RESPONSE 

I. Status of the Claims 

Claim 1 has been cancelled without prejudice and without disclaimer. Claim 3 has been 
amended. No new claims have been added. 

Claims 3-18 are therefore presently pending in the case. 

II. Su pport for the Amendments and Newly Added Claims 

The specification has been amended to include a new title that is more descriptive of the 
invention to which the claims are directed. Support for the new title can be foxmd in the original title, 
and throughout the specification and claims as originally filed. 

Claim 3 has been amended to reference claim 4, since claim 1 has been cancelled without 
prejudice and without disclaimer. Support for this claim can be found throughout the specification as 
originally filed. 

It will be understood that no new matter is included within the new title, or the amended claim. 

III. Title 

The Action objects to the title of the application based on the term "novel". Applicants have 
amended the title of the present application to remove the term "novel". 

Applicants request that, since the objection has been overcome, this objection be withdrawn. 

IV. Rejection of Claims 1 and 3-18 Under 35 U.S.C. 8 101 

The Action first rejects claims 1 and 3-18 under 35 U.S.C. § 101, as allegedly lacking a 
patentable utility. Applicants respectfiiUy traverse. 

First, while Applicants in no way agree with the Examiner's position that claim 1 lacks a 
patentable utihty, as claim 1 has been cancelled entirely without prejudice and without disclaimer, the 
present rej ection of claim 1 under 3 5 U. S .C . § 1 0 1 is rendered moot. The remainder of this section 
will therefore focus on claims 3-18. 

The present invention has a number of substantial and credible utilities, not the least of which 
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is in forensic biology, as described in the specification, at least at page 4, lines 27-30. As described 
in the specification, firom page 8, line 1 0, through page 9, line 8, the present sequences define several 
coding single nucleotide polymorphisms - specifically: a G/A polymorphism at nucleotide position 
239 of SEQ ID NO:l, which results in a glutamine or arginine residue being present at the 
corresponding amino acid (aa) position 80 of SEQ ID NO:2; a silent G/T polymorphism at nucleotide 
position 723 of SEQ ID NO:l, both of which result in the same amino acid being present at the 
corresponding aa position of SEQ ID NO:2; an A/T polymorphism at nucleotide position 766 of 
SEQ ID NO: 1 , which results in a leucine or methionine residue being present at the corresponding aa 
position 256 of SEQ ID NO:2; a silent T/C polymorphism at nucleotide position 1 074 of SEQ ID 
NO : 1 , both of which result in the same amino acid being present at the corresponding aa position of 
SEQIDNO:2; a G/A polymorphism at nucleotide position 1075 of SEQ ID NO: 1, which results in 
a glutamate or lysine residue being present at the corresponding aa position 359 of SEQ ID NO:2; 
a G/A polymorphism at nucleotide position 11 95 of SEQ ID NO: 1 , which results in an isoleucine or 
valine residue being present at the corresponding aa position 399 of SEQ ID NO:2; a G/A 
polymorphism at nucleotide position 53 of SEQ ID NO:3, which results in a glutamine or arginine 
residue being present at the corresponding aa position 18 of SEQ ID NO:4; a silent G/T 
polymorphism at nucleotide position 537 of SEQ ID NO:3, both of which result in the same amino 
acid being present at the corresponding aa position of SEQ ID NO:4; an A/T polymorphism at 
nucleotide position 580 of SEQ ID NO:3, which results in a leucine or methionine residue being 
present at the corresponding aa position 194 of SEQ ID NO:4; a silent T/C polymorphism at 
nucleotide position 888 of SEQ ID NO:3, both of which result in the same amino acid being present 
at the corresponding aa position of SEQ ID NO:4; a G/A polymorphism at nucleotide position 889 of 
SEQ ID NO:3, which results in a glutamate or lysine residue being present at the corresponding aa 
position 297 of SEQ ID NO:4; and a G/A polymorphism at nucleotide position 1009 of SEQ ID 
NO:3 , which results in an isoleucine or valine residue being present at the corresponding aa position 
337 of SEQ ID NO:4. As such polymorphisms are the basis for forensic analysis, which does not 
require the identification of a specific medical condition, and is imdoubtedly a "real world" utility, the 
present sequences must in themselves be usefiil. Thus, the present claims clearly meet the requirements 
of 35 U.S.C. § 101. 
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The Examiner states that "(n)either the specification nor the art of record disclose any diseases 
or conditions associated with the function or expression of the NGPCR protein, therefore, there is no 
'real world' context of use" (the Action at page 4), First, Applicants point out that the disclosure of 
"any diseases or conditions associated with the function or expression of the NGPCR protein" is 
not the standard forpatentabilityunder 35 U.S.C. § 101 (InreBrana, 34USPQ2d 1436 (Fed. Cir. 
1995); ''Brana''). Furthermore, Applicants reiterate that the use of the presently described 
polymorphisms in forensic analysis does not require the identification of a specific medical condition. 
One aspect of forensic analysis is to distinguish individual members of the human population &om one 
another based solely on the presence or absence of one or more polymorphic markers, such as the 
presently described polymorphisms. As polymorphic markers such as the presently described 
polymorphism have been used in forensic analysis for decades, this is clearly a well established 
technique, and as such, specific guidance does not need to be provided in the present specification, for 
it has long been established that a patent need not disclose what is well known in the art (In re Wands ^ 
8USPQ2d 1400 (Fed. Cir. 1988)). Thus, the Examiner's argument does not support the alleged lack 
of utility. 

This is also not a case of a "potential" utility. Using the polymorphic markers exactly as 
described in the specification as originally filed, the skilled artisan can readily distinguish individuals fiom 
one another. Applicants point out that in the worst case scenario, each polymorphic marker is useful 
to distinguish 50% of the population (in other words, themarkerbeingpresent in half ofthe population). 
This is an inherent feature of any polymorphic marker, as the largest percentage of a population that two 
polymorphic markers can define is 50% each. If a polymorphic marker is present at a level of less than 
50%, then that marker is even more informative, i, e, , a greater percentage of the population can be 
distinguished on the basis ofthe marker. Nevertheless, the ability to eliminate even 50% ofthe 
population from a forensic analysis clearly is a real world, practical utility. 

Furthermore, with regard to a '"real world' context of use", Applicants point out that naturally 
occurring genetic polymorphisms such as the polymorphisms described in the specification as originally 
filed are both the basis of, and critical to, inter alia, forensic genetic analysis intended to resolve issues 
of, for example, identity or paternity. Forensic analysis based on polymorphisms such as the 
polymorphisms identified by Applicants is used to positively identify or rule out suspects in many 
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criminal cases, and in identifying human remains. Paternity detennination is based on polymorphisms 
such as the polymorphisms identified by Applicants to positively identify or rule out individuals 
suspected of fathering a particular child. What could be possibly be more substantial and "real world" 
than the loss of an individual ' s freedom or life through incarceration? What could be possibly be more 
substantial and "real world" than the positive identification ofhuman remains? What could be possibly 
be more substantial and "real world" than the impact, both economic and emotional, that the results of 
a paternity analysis has on the individuals directly and indirectly involved? These are all well known and 
generally accepted uses of polymorphisms such as the polymorphisms identified by Applicants. 
Without such identified polymorphisms, the skilled artisan would not be able to carry out such forensic 
or paternal analyses. Therefore, as the use of the presently described polymorphic markers in forensic 
analysis is clearly a "real world" and substantial utility, the presently claimed sequences meet the 
requirements of 35 U.S.C. § 101. 

The Examiner next states that "(f)urther research to identify or reasonably confirm a *real world' 
context of use is required" (the Action at page 4). First, Applicants reiterate that the use of the 
presently described polymorphic markers in forensic analysis, as detailed above, requires no fiirther 
research. Thus, the presently described polymorphisms can be used to distinguish individuals fi-om one 
another in their currently available form. Second, Applicants respectfiiUy point out that the proper 
standardformeetingtherequirementsof35 U.S.C. § 101 is not whettier "fiirther research" is required 
to practice certain aspects of the claimed invention, but whether undue experimentation would be 
required to practice the claimed invention. The widespread use of polymorphisms such as those 
described by Applicants in forensic analysis every day strongly argues against such a use requiring 
"undue experimentation". Applicants point out that in assessing the question of whether undue 
experimentation would be required in order to practice the claimed invention, the key term is **undue", 
not "experimentation". In re Angstadt and Griffin, 190USPQ214(CCPA 1976). However, even 
if, arguendo y fiirther research might be required in certain aspects of the present invention, this does 
not preclude a finding that the invention has utility, as set forth by the Federal Circuit's holding in Brana 
{supra\ which states that "pharmaceutical inventions, necessarily includes the expectation of fiirther 
research and development " {Brana at 1442-1443, emphasis added). Thus, the need for some 
experimentation clearly does not render the claimed invention unpatentable. Indeed, a considerable 
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amount of experimentation maybe permissible if such experimmtation is routinely practiced in the art. 
In re Angstadt and Griffin^ supra\ Amgen, Inc, v. Chugai Pharmaceutical Co., Ltd^ 
18 USPQ2d 1016 (Fed. Cir. 1991). Thus, the present claims clearly meet the requirements of 
35 U.S.C. § 101. 

Applicants respectfully point out that as the presently described polymorphisms are apart of 

the family of polymorphisms that have a well-established utility, the Federal Circuit's holding in 

Brana (supra) is directly on point. InBrana, the Federal Circuit admonished the United States Patent 

and Trademark OfiBce (* the USPTO") for confusing "the requirements under the law for obtaining a 

patent with the requirements for obtaining government ^proval to market a particular dmg for human 

consumption". Brana at 1442. The Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical 
inventions, what must the appUcant provide regarding the practical utility or usefulness 
of the invention for which patent protection is sought. This is not a new issue: it is one 
which we would have thought had been settled bv case law vears ago . 

Brana at 1439, emphasis added. The choice of the phrase ''utility or usefulness" in the foregoing 

quotation is highly pertinent. The Federal Circuit is evidently using ''utility" to refer to rejections under 

35 U.S.C. § 101, and is using "usefulness" to refer to rejections under 35 U.S.C. § 112, first 

paragraph. This is made evident in the continuing text in5ra«a, which explains the correlation between 

35 U.S.C. §§ 101 and 1 12, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not aprerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context 
of pharmaceutical inventions, necessarily includes the expectation of further research 
and development. The stage at which an invention in this field becomes useful is well 
before it is ready to be administered to humans. Were we to require Phase n testing 
in order to prove utility, the associated costs would prevent many companies fi"om 
obtaining patent protection on promising new inventions, thereby eliminating an 
incentive to pursue, through research and development, potential cures in many cmcial 
areas such as the treatment of cancer. 

Branaat 1442-1443, citations omitted. Thus, based on the holding in the present claims meet 

the requirements imder 35 U.S.C. § 101 and35U.S.C. § 112, first paragraph (see Section V, below). 

It is important to note that it has been clearly established that a statement of utility in a 
specification must be accepted absent reasons why one skilled in the art would have reason to doubt 
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the objective truth of such statement. In reLanger, 503 F.2d 1380, 1391, 183 USPQ 288, 297 

(CCPA, 1974; ''Langer''); In re Marzocchi, 439 F.2d 220, 224, 169 USPQ 367, 370 (CCPA, 

1971). As clearly set forth in Langer: 

As a matter of Patent Office practice, a specification which contains a disclosure of 
utility which corresponds in scope to the subject matter sought to be patented must be 
taken as sufficient to satisfy the utility requirement of § 1 0 1 for the entire claimed 
subj ect matter unless there is a reason for one skilled in the art to question the obj ective 
tmth of the statement of utility or its scope. 

Langer at 297, emphasis in original. As set forth in the Manual of Patent Examining Procedure 
(**MPEP"), "Office personnel must provide evidence sufficient to show that the statement of asserted 
utility would be considered ' false' by a person of ordinary skill in the art" (MPEP, Eighth Edition at 
2 1 00-40, emphasis added). Therefore, absent evidence fi*om the Examiner that the presently described 
polymorphic markers could not be used in forensic analysis, as the skilled artisan would readily 
understand that the present polymorphic markers have utility in forensic analysis, the present claims 
clearly meet the requirements of 35 U.S.C. § 101. 

The Ex£iminer states that the invention lacks a patentable utility because the specification 
"does not disclose the biological role of this protein or its significance" (the Action at page 2). 
AppUcants disagree, as the presently claimed sequences are clearly referred to as G-protein coupled 
receptors ("GPCRs"; see, at least, the specification at page 2, lines 7-10), and fiirther, that such 
GPCRs "are typically involved in transduction pathways involving G-proteins or PPG proteins" 
(specification at page 2, lines 1-2). Furthermore, Applicants would like to invite the Examiner's 
attention to the fact a sequence sharing nearly 1 00% percent identity at the protein level over nearly the 
fiill length of the claimed sequences is present in the leading scientific repository for biological sequence 
data (GenBank), and has been annotated by third party scientists wholly unaffiliated with Applicants 
as *TIomo sapiens G-protein coupled receptor GPRl 11" (GenBank accession number NM_1 53839; 
GenBank report and alignment provided in Exhibit A). Furthermore, Applicants respectfully point out 
that GPRl 1 1 has been classified as an adhesion GPCR (Fredriksson etaL , FEES Lett 531 :407-414, 
2002, and Bjamadottir et al. Genomics 84:23-33, 2004; abstracts provided in Exhibit B). 
Example 1 0 of the Revised Interim UtiUty Guidelines Training Materials (pages 53-55 ; Exhibit C), 
which have been set forth by the USPTO, clearly establishes that a rej ection under 35U.S.C.§101 
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as allegedly lacking a patentable utility, and under 35 U.S.C. § 1 12, first paragraph, as allegedly 
unusable by the skilled artisan due to the alleged lack of patentable utility (see Section V, below), is not 
proper when a full length sequence (such as the presently claimed sequence) has a similarity score 
greater than 95% to a protein having a "well established utility". Therefore, as the present situation 
tracks Example 1 0 of the Revised Interim UtiUty Guidelines Training Materials, the USPTO ' s own 
examination guidelines clearly indicate that the present claims meet the requira^ § 101 

and 35 U.S.C. § 1 12, first paragraph (see Section V, below). Thus, the present rejection of claims 
3-18 should be withdrawn. 

The Examiner cites an article by Doerks etal, (Trends in Genetics 14:248-250, 1998) for the 
proposition that sequence-to-function methods of assigning protein function are prone to errors. 
However, Doerks et aL states that "utilization of family information and thus a more detailed 
characterization" should lead to "simplification of update procedijres for the entire families if functional 
information becomes available for at least one member ^' (Doerks et aL , page 248, paragraph bridging 
columns 1 and 2, emphasis added). Applicants point out that, as detailed above, a sequence sharing 
nearly 100% percent identitv at the protein level over nearly the fiill length of the claimed sequences is 
present in the leading scientific repository for biological sequence data (GenBank), and has been 
annotated by third party scientists wholly unaffiliated with Applicants as an adhesion GPCR 
(see Exhibits A and B). The adhesion GPCRs are a well-studied protein family with a large amount 
of known functional information, exactly the situation that Doerks etal, suggests will "simplify** and 
"avoid the pitfalls" of previous sequence-to-function methods of assigning protein function 
(Doerks et a/., page 248, columns 1 and 2). Thus, instead of supporting the Examiner's position 
against utility, Doerks et al. actually supports Applicants' position that the presently claimed 
sequences have a substantial and credible utility. 

The Examiner next cites Brenner ( Trends in Genetics 15:132-133,1 999) as teaching that 
'*most homologs must have diflFerent molecular and cellular functions' ' (the Action at page 3). However, 
this statement is based on the assumption that ' 'if there are only 1 000 superfamilies in nature, then most 
homologs must have different molecular and cellular functions" (Brenner, page 1 32, second column). 
Furthermore, Brenner suggests that one of the main problems in using homology to predict function is 
"an issue solvable by appropriate use of modem and accurate sequence comparison procedures" 
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(Brenner, page 1 32, second column), and in fact references an article by Altschul et al , which is the 
basis for one of the "modem and accurate sequence comparison procedures" used by Applicants. 
Thus, the Brenner article also does not support the alleged lack of utility. 

The Examiner next cites Bork et al, {Trends in Genetics 1 2 :425-427, 1 996) as supporting 
the proposition that prediction of protein function from homology information is somewhat 
unpredictable, based on the "stmctural similarity of a small domain of the new protein to a small domain 
of a known protein" (the Action at page 3). However, the Examiner' s reliance on Bork et al is based 
on a faulty assimiption, specifically, the assumption that Applicants' assertion that the present 
sequence is a GPCR is made on the basis of structural similarity of a small domain of the new protein 
to a small domain of a known protein. AppUcants again would like to invite the Examiner' s attention 
to the fact a sequence sharing nearly 1 00% percent identitv at the protein level over nearly th e flill length 
of the claimed sequences is present in the leading scientific repository for biological sequence data 
(GenBank), and has been annotated by third party scientists wholly unaffiliated with Applicants as 
an adhesion GPCR (see Exhibits A and B). Thus, Applicants assertion that the present sequence is 
a GPCR is not made on the basis of "structural similarity of a small domain of the new protein to a small 
domain of a known protein", but rather vast homology over a large tract of the sequence. Thus, 
Bork et al also does not support the alleged lack of utility for the present invention. 

The Examiner finally cites Yan et al {Science 290:523-527, 2000; "Yan") to support the 
alleged lack of utility. However, Yan cites only one example, two isoforms of the anhidrotic ectodermal 
dysplasia (EDA) gene, where a two amino acid change conforms one isoform (EDA-Al) into the 
second isoform (EDA-A2). While it is true that this amino acid change results in binding to different 
receptors, it is important to note that the different receptors boxmd by the two isoforms are in fact 
related (Yan at page 523). Furthermore, the ED A-A2 receptor was correctly identified as a member 
of the tumor necrosis factor receptor superfamily based solely on sequence similaritv (Yan at 
page 523). Thus, Yan is hardly indicative of a high level of uncertainty in assigning function based on 
sequence, and thus also does not support the alleged lack of utility. 

Furthermore, notwithstanding the deficiencies detailed above, with regard to the citation of art 
to support the present rejection under 35 U.S.C. § 101, Applicants first note for the record that 
scientific manuscripts from 1 996, 1 998, 1 999, and 2000 can hardly be considered to reflect the state 
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of the art at the time the present application was filed. Second, and more importantly, such citations 
reflect that the Examiner appears to believe that extensive stmctural similarity is not enough to establish 
a specific utility. Applicants respectfully point out the Examiner' s position directly contradicts the 
position of the USPTO itself, as set forth in Example 1 0 of the Revised haterim Utility Guidelines 
Training Materials (see Exhibit C), which clearly establishes that structural similarity can in fact be used 
to establish fiinction, and thus establish a specific utility. Therefore, as the USPTO's own examination 
guidelines clearly indicate that structural similarity can in fact be used to estabUsh function, and thus 
establish a specific utility, the present claims meet the requirements of 35 U.S.C. § 101 and 
35 U.S.C. § 1 12, first paragraph (see Section V, below), and the present rejection of claims 3-18 
should be withdrawn. 

Applicants respectfully point out that of the pharmaceutical products currently being 
marketed by the entire industry, 60% of these products target G-protein coupled receptors (Gurrath, 
Curr.Med, Chem. 8:1605-1648, 2001; abstractprovidedinExhibitD). Given that more than half 
of the currently marketed drugs target proteins that are structurally (7TM proteins) and functionally 
(G-protein interaction) related to the presently described sequences, a preponderance of the 
evidence clearly weighs in favor of i^plicants' assertion that the skilled artisan would readily recognize 
that the presently described sequences have a nimiber of specific (the claimed GPCR proteins are 
encoded by a specific locus on the human genome), credible, and well-established utilities in addition 
to those detailed above, for example in tracking gene expression. As the specification as originally filed 
details on page 10, lines 24-27, the present nucleotide sequences have utility in assessing gene 
expression patterns using high- throughput DNA chips. Such "DNA chips" clearly have utility, as 
evidenced by hundreds of issued U.S. Patents, as exemplified by U.S. Patent Nos. 5,445,934, 
5,556,752, 5,744,305 (Exhibits E-G; submitted with the Information Disclosure Statement filed on 
March 22, 2002), and U.S. PatentNos. 5,837,832, 6,156,501 and 6,261,776 (Exhibits H-J; copies 
of issued U.S. Patents not provided pursuant to requests fi-om the USPTO). As the present sequences 
are specific markers of the human genome (see below), and such specific markers are targets for the 
discovery of drugs that are associated with human disease, those of skill in the art would instantly 
recognize that the present nucleotide sequences would be an ideal, novel candidate for assessing gene 
expression using such DNA chips. Given the widespread utility of such "gene chip" methods using 
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public domain gene sequence information, there can be little doubt that the use of the presently 
described novel sequences would have great utility in such DNA chip applications. Clearly, 
compositions that enhance the utility of such DNA chips, such as the presently claimed nucleotide 
sequences, must in themselves be useful. 

Further evidence of the "real world" substantial utility of the present invention is provided by 
the fact that there is an entire industry established based on the use of gene sequences or fragments 
thereofin a gOTe chip format. Perhaps the most notable gene chip company is Affymetrix. However, 
there are many companies which have, at one time or another, concentrated on the use of gene 
sequences or fragments, in gene chip and non-gene chip formats, for example: Gene Logic, ABI- 
Perkin-Ehner, HySeq and fricyte. In addition, one such company (Rosetta Inpharmatics) was viewed 
to have such "real world" value that it was acquired by large a pharmaceutical company (Merck) for 
significant sums of money (net equity value of the transaction was $620 milUon). The "real world" 
substantial industrial utility of gene sequences or fragments would, therefore, qjpear to be widespread 
and well established. Clearly, persons of skill in the art, as well as venture capitalists and investors, 
readily recognize the utility, both scientific and commercial, of genomic data in general, and specifically 
human genomic data. Billions of dollars have been invested in the human genome proj ect, resulting in 
usefiil genomic data (see, e.g. , Venter et al , Science 291 : 1 304, 2001 ; Exhibit K). The results have 
been a stunning success as the utility ofhuman genomic data has been widely recognized as a great gift 
to hvimanity (see, e,g.^ Jasny and Kennedy, Science 291: 11 53, 2001; Exhibit L). Clearly, the 
usefiilness ofhuman genomic data, such as the presently claimed nucleic acid molecules, is substantial 
and credible (worthy ofbillions of dollars and the creation of numerous companies focused on such 
information) and well-established (the utility ofhuman genomic information has been clearly understood 
for many years). Thus, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

The Examiner alleges that this asserted utiUty is **not specific or substantial" because "(s)uch 
assays can be performed with any polynucleotide" (the Action at page 5). This argument is flawed in 
a number of respects. First, Applicants respectfiiUy point out that only expressed polynucleotide 
sequences can be used to track gene expression, not just "any polynucleotide". Furthermore, 
expression profiling does not even require a knowledge of the function of the particular nucleic acid on 
the chip - rather the gene chip indicates which DNA fragments are expressed at greater or lesser levels 
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in two or more particular tissue types. Skilled artisans already have used and continue to use sequences 

such as Applicants in gene chip applications without further experimentation. Second, the Examiner 

appears to be confusing the requirement for a specific utility, which is the proper standard for utility 

under35U.S.C. § 101. with a requirement for a unique utility, which is clearly an improper standard. 

As clearly set forth by the Federal Circuit in CarlZeiss Stiftungv. RenishawPLQ 20 USPQ2d 1 101 

(Fed. Cir. 1991; ''Carl Zeiss'y. 

An invention need not be the best or only way to accomplish a certain result, and it 
need only be useful to some extent and in certain applications: "[T]he fact that an 
invention has only limited utility and is only operable in certain applications is not 
grounds for finding a lack of utiUty." Envirotech Corp. v. Al George, Inc. , 22 1 USPQ 
473, 480 (Fed. Cir. 1984) 

Following directly firora the quote above, an invention does not need to be the only way to accomplish 
a certain result. Thus, the question of whether or not other nucleic acid sequences can be used to 
assess gene expression patterns is completely irrelevant to thepresait utility inquiry. The only relevant 
question in regard to meeting the standards of 35 U.S.C. § 101 is whether "any polynucleotide" can 
be so used - and the clear answer to this question is an emphatic no. Importantly, the holding in 
Carl Zeiss is mandatory legal authority that essentially controls the outcome of the present case. This 
case, and particularly the cited quote, directly rebuts the Examiner's argument. Furthermore, the 
requirement for a unique utility is clearly not the standard adopted by the USPTO. If every invention 
were required to have a xmique utihty, the USPTO would no longer be issuing patents on batteries, 
automobile tires, golfballs, golf clubs, and treatments for a variety of human diseases, such as cancer 
and bacterial or viral infections, just to name a few particular examples, because examples of each of 
these have already been described and patented. All batteries have the exact same utility - specifically, 
to provide power. All automobile tires have the exact same utility - specifically, for use on automobiles. 
All golfballs and golf clubs have the exact same utility - specifically, use in the game of golf All cancer 
treatments have the exact same utility - specifically, to treat cancer. All anti-infectious agents have the 
exact same broader utility - specifically, to treat infections. However, only the briefest perusal of 
virtually any issue of the Official Gazette provides numerous examples of patents being granted on each 
of the above compositions every week . Additionally, if a composition needed to be unique to be 
patented, the entire class and subclass system would be an effort in futility, as the class and subclass 
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system serves solely to group such common inventions, which would not be required if each invention 
needed to have a unique utility. Thus, the present sequence clearly meets the requirements of 
35 U.S.C. § 101. 

Applicants note that the Examiner correctly determines that the generic class with regard to 
the present invention is "any polynucleotide", but then attempts to narrow the generic class of the 
invention to include only those nucleic acids that are expressed in order to support an allegation that the 
claimed nucleic acids lack a "specific" utility. Applicants reiterate that not all nucleic acids are 
expressed - in fact, only 2-4% of all nucleotide sequences are expressed. Therefore, the question of 
whether the asserted utility is "specific", as opposed to "generic", has clearly been laid to rest. 
Applicants note that such redefinition of the generic class of the invention is completely improper, and 
in clear defiance of established case law. Therefore the present claims are clearly in compliance with 
35 U.S.C. § 101. 

The Examiner fiirther discounts this assertion of utility because **the specification does not 
disclose the tissues or cell types the polypeptide/mRNA are normally expressed in" (the Action at 
page 5). This is simply not true. The specification as originally filed at page 5, lines 11-14, clearly 
states that "(t)he human NGPCRs described for the first time herein are novel receptor proteins that 
are expressed in human pituitary, testis, skeletal muscle, adipose, esophagus, cervix, pericardium, fetal 
kidney, and fetal lung cells". Thus, the Examiner' s argument in no way supports the allegation that the 
presently claimed sequences lack a patentable utility. 

It has been well established that Applicants need only make one credible assertion of utility to 
meet the requirements of 35 U.S.C. § 101 {Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); 
InreGottlieb, 140 USPQ 665 (CCPA \96A)\ In re Malachowski, 189USPQ432 (CCPA 1976); 
Hoffman v. Klaus, 9 USPQ2d 1657 (Bd. Pat. App. & Inter. 1988)), and, thus, any questions 
concerning whether or not the present claims meet the requirements of 3 5 U.S.C. § 1 0 1 should have 
been laid to rest. Nevertheless, as a fiirther example of the utility of the presently claimed 
polynucleotides, as described in the specification at page 4, lines 19-24, the present nucleotide 
sequences have a specific utiUty in "identification of protein coding sequence" and "mapping a unique 
gene to a particular chromosome". The specification as originally filed, at page 4, lines 22-24, details 
that the gene encoding the presently claimed sequences is present on "human chromosome 6, see 



16 



GENBANKaccessionno. AL356421". fii fact, alignment ofSEQ ID NO: 1 with GenBank Accession 
Number AL3 5642 1 (which is a genomic clone from human chromosome 6) shows that the hxmian 
gene corresponding to SEQ ID NO: 1 is dispersed on 5 exons ofhuman chromosome 6 (alignment and 
the first page from the GenBank report are presented in Exhibit M). Clearly, the present 
polynucleotide provides exquisite specificity in localizing tiie specific region ofhuman chromosome 6 
that contains the gene encoding the given polynucleotide, autilitynot shared b v virtually anv other 
nucleic acid sequences. In fact, it is this specificity that makes this particular sequence so usefiil. Early 
gene mapping techniques relied on methods such as Giemsa staining to identify regions of 
chromosomes. However, such techniques produced genetic maps with a resolution of only 5 to 1 0 
megabases, far too low to be of much help in identifying specific genes involved in disease. The skilled 
artisan readily appreciates the significant benefit afforded by markers that map a specific locus of the 
human genome, such as the present nucleic acid sequence. For further evidence in support of the 
Applicants* position, the Examiner is requested to review, for example, section 3 of Venter et al. 
(5w/?ra,atpp. 1317-1321, including Fig. 11 at pp. 1324- 1325; see Exhibit K), which demonstrates 
the significance of expressed sequence information in the structural analysis of genomic data. The 
presently claimed polynucleotide sequence defines a biologically validated sequence that provides a 
unique and specific resource for mapping the genome essentially as described in the Venter et al. 
article. Thus, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Applicants reiterate that only a minor percentage (2-4%) of the genome actually encodes 
exons, which in-tum encode amino acid sequences. Significantly, the claimed polynucleotide sequence 
defines how the encoded exons are actually spliced together to produce an active transcript (z. e. , the 
described sequences are useful for functionally defining exon splice-junctions). As described in the 
specification as originally filed at page 4, lines 24-27, the claimed "sequences identify actual, biologically 
relevant, exon splice junctions, as opposed to those that might have been predicted bioinformatically 
from genomic sequence alone". The specification as originally filed, from page 1 5 , line 3 3 , through 
page 16, line 2, further details that "sequences derived from regions adjacent to the intron/exon 
boundaries of the himian gene can be used to design primers for use in amplification assays to detect 
mutations within the exons, introns, splice sites {e.g, , spUce acceptor and/or donor sites), etc., that can 
be used in diagnostics and pharmacogenomics". Applicants respectfully submit that the practical 



17 



scientific value o fbiologicallv validated expressed, spliced, andpolyadenylatedmRNA sequences is 
readily apparent to those skilled in the relevant biological and biochemical arts. Thus, the present 
sequence clearly meets the requirements of 35 U.S.C. § 101. 

Once again, the Examiner alleges that this asserted utility is "not specific or substantial" 
because "(s)uch assays can be performed with any polynucleotide" (the Action at page 5). With 
respect to the presently asserted utility, this argument is once again flawed in a number of respects. 
First, Applicants once again point out that only expressed sequences can be used in the identification 
of coding sequence, not just "any polynucleotide". Second, Applicants reiterate that the requirements 
of a specific utility, which is the proper standard for utility under 35 U.S.C. §101, should not be 
confijsed with the requirement for a unique utility, which is clearly an improper standard {Carl Zeiss ^ 
supra). The fact that a small number of other nucleotide sequences could be used to map the protein 
coding regions in this specific region of chromosome 6 does not mean that the use of Applicants' 
sequence to map the protein coding regions of chromosome 6 is not a specific utility. Once again, the 
question of whether or not other nucleic acid sequences can be so used is completely irrelevant to the 
present utility inquiry. Theonlyrelevantquestioninregardtomeetingthestandardsof35U.S.C. § 101 
is whether "any polynucleotide" can be so used - and the clear answer to this question is once again 
an emphatic no. Applicants respectfully point out the Examiner is once again attempting to narrow the 
generic class of "any polynucleotide" to include onlv the small number of nucleic acid molecules that 
are expressed fix)m this particular region of chromosome 6 in order to support the allegation that the 
claimed nucleic acids lack a "specific" utiUty. Applicants respectfiilly point out once again that this is 
improper under the law as well as the policy of the USPTO. Thus, the present claims clearly meet 
the requirements of 35 U.S.C. § 101. 

Lastly, the Examiner cites 5re««erv.Man5c?w (383 U.S. 519, 148 USPQ 689(S.Ct. 1966); 
''Brenner^') to support the alleged lack of utility. However, the Federal Circuit, citing 5rewier, recently 
affirmed that "(t)he threshold ofutility is not high: An invention is 'usefiil' under section 101 if it is 
capable ofproviding some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 5\\]SVQ2d 
1700 (Fed. Cir. 1999). Additionally, the Federal Circuit has stated that "(t)o violate § 101 the claimed 
device must be totallv incapable of achieving a usefiil result." Brooktree Corp, v. Advanced Micro 
Devices Jnc.,911¥.2d\555, 1571 (Fed. Cir. 1992), emphasis added. CTO55v.//zM^a(224USPQ 
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739 (Fed. Cir. 1 985); ^^Cross*^) states "any utility of the claimed compounds is sufficient to satisfy 
35 U.S.C. § 101". Cross at 748, emphasis added. Indeed, the Federal Circuit has emphatically 
confirmed that "anything under the sun that is made by man" is patentable {State Street Bank & Trust 
Co. V. Signature Financial Group Inc. , 47 USPQ2d 1 596, 1 600 (Fed. Cir. 1 998), citing the U.S. 
Supreme Court's decision ini)/a»2o«^/v5. C/iaferaZ>ar/y, 206 USPQ 193 (S.Ct. 1980)). Thus, all of 
the evidence presented above and the relevant case law, including that cited by the Examiner, supports 
the Applicants' assertion that the presently claimed sequences have apatentable utility, and are thus 
fiiUy compliant with the requirements of 35 U.S.C. § 101. 

Finally, the requirements set forth in the Action for compliance with 35 U.S.C. § 1 01 do not 
comply with the requirements set forth by the USPTO itself for compliance with 3 5 U. S . C . § 1 0 1 . 
AYhile Applicants are well aware of the new Utility Guidelines set forth by the USPTO, Applicants 
respectfully point out that the current rules and regulations regarding the examination of patent 
appUcations is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Applicants are unaware of any significant recent changes in either 
35U,S.C. § 101, or in the interpretation of 35 U.S.C. § 101 bythe Supreme Court or the Federal 
Circuit that is in keeping with the new Utility Guidelines set forth by the USPTO. This is underscored 
by numerous patents that have been issued over the years that claim nucleic acid fi:agments that do not 
comply with the new Utility Guidelines. As just a few examples of such issued U.S. Patents, the 
Examiner is invited to review U.S. Patent Nos. 5 ,8 1 7,479, 5,654, 1 73, and 5,5 52,28 1 (each of which 
claims short polynucleotides; Exhibits N-P; copies of issued U.S. Patents not provided pursuant to 
requests firom the USPTO), and U.S. PatentNo. 6,340,583 (which includes no working examples; 
Exhibit Q; copies of issued U.S. Patents not provided pursuant to requests from the USPTO), none 
of which contain examples of the "real-world" utilities that the Examiner seems to be requiring. As 
issued U.S. Patents are presumed to meet all of the requirements for patentability, including 
35 U.S.C. §§ 101 and 1 12, first paragraph (see Section V, below), Applicants submit that the present 
polynucleotides must also meet the requirements of 35 U.S.C. § 101. While Applicants understand that 
each application is examined on its own merits. Applicants are imaware of any changes to 



35U.S.C. § 101, or in the interpretation of 35 U.S.C.§ lOlbythe Supreme Court or the Federal 
Circuit, since ttie issuance of these patents that render the subject matter claimed in these patents, which 
is similar to the subj ect matter in question in the present application, as suddenly non-statutory or failing 
to meet the requirements of 35 U.S.C.§ 101. Thus, holding Applicants to a different standard of utility 
would be arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons. Applicants submit that as the presently claimed nucleic acid 
molecules have been shown to have a substantial, specific, credible and well-estabhshed utility, the 
rejection of claims 1 and 3-18 under 35 U.S.C. § 101 has been overcome, and request that the 
rejection be withdrawn. 

V. Rejection of Claims 1 and 3>18 Under 35 U,S,C, § 112, First Paragraph 

The Action next rejects claims 1 and 3-18 xmder 35 U.S.C. § 1 12, first paragraph, since 
allegedly one skilled in the art would not know how to use the invention, as the invention allegedly is 
not supported by a specific, substantial, and credible utility or a well-established utility. Applicants 
respectfully traverse. 

First, while Applicants in no way agree with the Examiner's position that one skilled in the art 
would not know how to use the invention as set forth in claim 1 , since claim 1 has been cancelled 
entirely without prejudice and without disclaimer, the present rejection of claim 1 under 
35U.S.C. § 112,firstparagraphisrenderedmoot. The remainder ofthis section will therefore focus 
on claims 3-18. 

Applicants submit that as claims 3-18 have been shown to have "a specific, substantial, and 
credible utility", as detailed in section IV above, the present rejection of claims 3-18 imder 
35 U.S.C. § 1 12, first paragraph, cannot stand. 

Applicants therefore request that the rejection of claims 1 and3-18 xmder 35 U.S.C. § 112, 
first paragraph, be withdrawn. 

VI. Rejection of Claims 1 and 3 Under 35 U.S.C. S 112, First Paragraph 

The Action next rejects claim 5 under 35 U.S.C. § 1 12, first paragraph, as allegedly not 
providing enablement for the ftdl scope of the claimed invention. While Applicants in no way agree with 
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the Examiner' s position that claims 1 and 3 are not enabled for the full scope of the claim, as claim 1 
has been cancelled entirely without prejudice and without disclaimer, and claim 3 has been amended 
to reference claim 4, which is not subject to the present rejection, the present rejection of claims 1 
and 3 under 35 U.S.C. § 1 12, first paragraph, has been overcome. 

Applicants therefore respectfully request that the rejection of claims 1 and 3 under 
35 U.S.C. § 1 12, first paragraph, be withdrawn. 

VII. Rejection of Claims 1 and 3 Under 35 U.S.C. § 112. First Paragraph 

The Action next rejects claims 1 and 3 under 35 U.S.C. § 112, first paragraph, as allegedly 
containing subj ect matter that was not described in the specification in such a way as to reasonably 
convey to one skilled in the relevant art that the inventors, at the time the application was filed, had 
possession of the claimed invention. While Applicants in no way agree with the Examiner's position 
thatclaims 1 and 3 do not meet the requirements of 35 U.S.C. § 112, first paragraph, as claim 1 has 
been cancelled entirely without prejudice and without disclaimer, and claim 3 has been amended to 
reference claim 4, which is not subject to the present rej ection, the present rejection of claims 1 and 3 
under 35 U.S.C. § 1 12, first paragraph, has been overcome. 

Applicants therefore respectfully request that the rejection of claims 1 and 3 under 
35 U.S.C. § 1 12, first paragraph, be withdrawn. 

VIII. Rejection of Claims 1 and 3 Under 35 U.S.C. S 102(a) 

The Action next rejects claims 1 and 3 under 35 U.S.C. § 102(a), as allegedly anticipated by 
Corby (GenBank accession number AL35642 1 ; "Corby")- While Applicants do not necessarily agree 
with the Examiner's position that claims 1 and 3 are anticipated by Corby, as claim 1 has been 
cancelled entirely without prejudice and without disclaimer, and claim 3 has been amended to reference 
claim 4, which is not subject to the present rejection, the present rejection of claims 1 and 3 under 
35 U.S.C. § 102(a) has been overcome. 

Applicants therefore respectfully request that the rejection of claims 1 and 3 under 
35 U.S.C. § 102(a) be withdrawn. 
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IX. Conclusion 

The present document is a full and complete response to the Action. In conclusion. Applicants 
submit that, in light of the foregoing remarks, the present case is in condition for allowance, and such 
favorable action is respectfully requested. Should Examiner Murphy have any questions or comments, 
or believe that certain amendments of the claims might serve to improve their clarity, a telephone call 
to the xmdersigned Applicants* representative is earnestly solicited. 



Respectfully submitted. 
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EXHIBIT A 



>NM_153839 ACCESSION:NM_153839 NID: gi 24475874 ref NM_153839.4 
Homo sapiens G protein-coupled receptor 111 (GPRlll) / 
mKNA 
Length = 2127 

Score = 1238 bits (3167), Expect =0.0 

Identities = 614/616 (99%), Positives = 615/616 (99%) 

Frame = +1 

Query: 25 AASKSKEKVPARPHGVCDGVCTDYPQCTQPCPPDTQGNMGFSCRQKTWHKITDTCQTLNA 84 

AASKSKEKVPARPHGVCDGVCTDY QCTQPCPPDTQGNMGFSCRQKTWHKITDTCQTLNA 
Sbjct: 277 AASKSKEKVPARPHGVCDGVCTDYSQCTQPCPPDTQGNMGFSCRQKTWHKITDTCQTLNA 456 

Query: 85 LNIFEEDSRLVQPFEDNIKISVYTGKSETITDMLLQKCPTDLSCVIRNIQQSPWIPGNIA 144 

LNIFEEDSRLVQPFEDNIKISVYTGKSETITDMLLQKCPTDLSCVIRNIQQSPWIPGNIA 
Sbjct: 457 LNIFEEDSRLVQPFEDNIKISVYTGKSETITDMLLQKCPTDLSCVIRNIQQSPWIPGNIA 63 6 

Query: 145 VIVQLLHNISTAIWTGVDEAKMQSYSTIANHILNSKSiSNWTFIPDRNSSYILLHSVNSF 204 

VIVQLLHNISTAIWTGVDEAKMQSYSTIANHILNSKSISNWTFIPDRNSSYILLHSVNSF 
Sbjct: 637 VIVQLLHNISTAIWTGVDEAKMQSYSTIANHILNSKSISNWTFIPDRNSSYILLHSVNSF 816 

Query: 205 ARRLFIDKHPVDISDVFIHTMGTTISGDNIGKNFTFSMRINDTSNEVTGRVLISRDELRK 264 

ARRLF I DKHPVDI SDVF IHTMGTTI SGDNIGKNFTFSMRINDTSNEVTGRVIil SRDELRK 
Sbjct: 817 ARRLFIDKHPVDISDVFIHTMGTTISGDNIGKNFTFSMRINDTSNEVTGRVIilSRDELRK 996 

Query: 265 VPSPSQVISIAFPTIGAILEASLLENVTVNGLVLSAILPKELKRISLIFEKISKSEERRT 324 

VPSPSQVISIAFPTIGAILEASLLENVTVNGLVLSAILPKELKRISLIFEKISKSEERRT 
Sbjct: 997 VPSPSQVISIAFPTIGAILEASLLENVTVNGLVLSAILPKELKRISLIFEKISKSEERRT 1176 

Query: 325 QCVGWHSVENRWDQQACKMIQENSQQAVCKCRPSELFTSFSILMSPHILESLILTYITYV 384 

QCVGWHSVENRWDQQACKMIQENSQQAVCKCRPS+LFTSFSILMSPHILESLILTYITYV 
Sbjct : 1177QCVGWHSVENRWDQQACKMIQENSQQAVCKCRPSKbFTSFSILMSPHILESLILTYITYV 1356 

Query: 385 GLGISICSLILCLSIEVLVWSQVTKTEITYLRHVCIVNIAATLLMADVWFIVASFLSGPI 444 

GLGISICSLILCLSIEVLVWSQVTKTEXTYLRHVCIVNIAATLLMADVWFIVASFLSGPI 
Sbjct : 1357GLGISICSLIIiCLSIEVIiWSQVTKTEITYIiRHVCIVNIAATLIiMADVWFIVASFLSGPI 1536 

Query: 445 THHKGCVAATFFVHFFYLSVFFWMLAKALLILYGIMIVFHTLPKSVLVASLFSVGYGCPL 504 

THHKGCVAATFFVHFFYLSVFFWMLAKALLILYGIMIVFHTLPKSVLVASLFSVGYGCPL 
Sbjct : 1537THHKGCVAATFFVHFFYLSVFFWMLAKALLXLYGIMIVFHTLPKSVLVASLFSVGYGCPL 1716 

Query: 505 AIAAITVAATEPGKGYLRPEICWLNWDMTKALLAFVIPALAIVWNLITVTLVIVKTQRA 564 

AIAAITVAATEPGKGYLRPEICWLNWDMTKALLAFVIPALAIVVVNLITVTLVIVK^ 
Sbjct : 1717AIAAIWAATEPGKGYLRPEICWLNWDMTKAIiLAFVIPALAIVVVNLITVTL^ 1896 

Query: 565 AIGNSMFQEVRAIVRISKNIAILTPLLGLTWGFGVATVIDDRSLAFHIIFSLLNAFQVSP 624 

AIGNSMFQEVRAIVRISKNIAILTPLLGLTWGFGVATVIDDRSLAFHIIFSLLNAFQVSP 
Sbjct : 1897AIGNSMFQEVRAIVRISKNIAILTPLIiGLTWGFGVATVIDDRSL.AFHIIFSLLNAFQVSP 207 6 

Query: 62 5 DASDQVQSERIHEDVL 640 

DASDQVQSERIHEDVL 
Sbjct: 2077DASDQVQSERIHEDVL 2124 
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Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 

JOURNAL 
MEDLINE 
PUBMED 
COMMENT 

FEATURES 

source 



gene 



CDS 



NM_153839 2127 bp mRNA linear PRI 18-DEC-'2004 

Homo sapiens G protein-coupled receptor 111 (GPRlll)-, mRNA. 

NM_153839 

NM_153839.1 GI:24475874 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Crania ta; Vertebrata; Euteleostomi 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 2127) 

Fredriksson,R. , Lagerstr, H and Schi . 

Novel h\iman G protein-coupled receptors with long N-terminals 

containing GPS domains and Ser/Thr-rich regions 

FEES Lett. 531 (3), 407-414 (2002) 

22323027 

12435584 

PROVISIONAL REFSEQ : This record has not yet been subject to final 
NCBI review. The reference sequence was derived- from AY140953 . 1 . 
Location/Qualif iers 
1. .2127 

/organism="Homo sapiens" 
/mol_type= "mRNA" 
/ db_xr e f = " t axon : 9 6 0 6 " 
/chromosome= " 6 " 
/map= "6pl2 .3" 
1. .2127 

/gene="GPRlll" 

/note= " synonyms : PGR20, hGPCR35" 
/ db^xref = " GenelD : 222611 " 
/ db_xr e f = " Locus ID : 222611 " 
1. .2127 

/gene="GPRlll" 

/note="G protein-coupled receptor PGR20; 
go_component : membrane [gold 0016020 ] [evidence lEA] ; 
go_f unction: receptor activity [gold 0004872 ] [evidence 

lEA] ; 

go_f unction: G-protein coupled receptor activity [gold 
0004930 ] [evidence lEA] ; 

go_process: neuropeptide signaling pathway [gold 0007218 ] 
[evidence lEA] " 
/codon_start=l 

/product=" G-protein coupled receptor 111" 
/protein_id= " NP_722581 . 1 " 
/db_xref="GI: 24475875" 
/db xref="GeneID: 222611 " 
/db_xref = " Locus ID : 222611 " 
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/ translation= "MGLTAYGNRRVQPGELPFGANLTLIHTRAQPVICSKLLLTKRVS 

PISFFLSKFQNSWGEDGWVQLDQLPSPNAVSSDQVHCSAGCTHRKCGWAASKSKEKVP 

ARPHGVCDGVCTDYSQCTQPCPPDTQGNMGFSCRQKTWHKITDTeQTLiNALNIFEEDS 

RLVQPFEDNIKISWTGKSETITDMLLQKCPTDLSCVIRNIQQSPWIPGNIAVIVQLL 

HNISTAIWTGVDEAKMQSYSTIANHILNSKSISNWTFIPDRNSSYILLHSVNSFARRL 

FIDKHPVDISDVFIHTMGTTISGDNIGKNFTFSMRINDTSNEVTGRVLISRDELRKVP 

SPSQVISIAFPTIGAILEASLLENVTVNGLVLSAILPKELKRISLIFEKISKSEERRT 

QCVGWHSVENRWDQQACKMIQENSQQAVCKCRPSKLFTSFSILMSPHILESLILTYIT 

YVGLGISICSLILCLSIEVLVWSQVTKTEITYLRHVCIVNIAATLLMADVWFIVASFL 

SGPITHHKGCVAATFFVHFFYLSVFFWMLAKALLILYGIMIVFHTLPKSVLiVASLFSV 

GYGCPLAIAAITVAATEPGKGYLRPEICWLNWDMTKALLAFVIPALAIVVVIS^ 

VIVKTQRAAIGNSMFQEVRAIVRISKNIAILTPLLGLTWGFGVATVIDDRSLAFHIIF 

SLLNAFQVSPDASDQVQSERIHEDVL " 

ORIGIN 

1 atggggctga ctgcctatgg gaaccgcagg gtccagcctg gggagctgcc attcggggct 
61 aacttgactc tcatccatac aagagcccag cctgtgattt gtagcaagct tctcttgaca 
' 121 aagagagtga gtcccatctc tttcttcttg tccaaatttc aaaattcctg gggtgaggat 
181 ggatgggttc agttggatca actgccatcc cctaatgcag tcagctctga ccaagtacac 
241 tgttcagctg gctgcacaca caggaaatgt ggatgggctg caagcaaaag caaggagaag 
301 gtgcctgcca ggccacacgg tgtatgcgat ggtgtctgta cagactactc ccagtgtact 
361 caaccttgcc ctccagacac tcagggaaat atggggtttt catgcaggca aaagacatgg 
421 cacaagatca ctgacacctg ccagactctt aatgccctca acatctttga ggaggattca 
481 cgtttggttc agccatttga agacaatata aaaataagtg tatatactgg aaagtctgag 
541 accataacag atatgttgct acaaaagtgt cccacagatc tgtcttgtgt aattagaaac 
601 attcagcagt ctccctggat accaggaaac attgccgtaa ttgtgcagct cttacacaac 
661 atatcaacag caatatggac aggtgttgat gaggcaaaga tgcagagtta cagcaccata 
721 gccaaccaca ttcttaacag caaaagcatc tccaactgga ctttcattcc tgacagaaac 
781 agcagctata tcctgctaca ttcagtcaac tcctttgcaa gaaggctatt catagataaa 
841 catcctgttg acatatcaga tgtcttcatt catactatgg gcaccaccat atcfcggagat 
901 aacattggaa aaaatttcac tttttctatg agaattaatg ataccagcaa tgaagtcact 
961 gggagagtgt tgatcagcag agatgaactt cggaaggtgc cttccccttc tcaggtcatc 
1021. agcattgcat ttccaactat tggggctatt ttggaagcca gtcttttgga aaatgttact 
1081 gtaaatgggc ttgtcctgtc tgcca.ttttg cccaaggaac ttaaaagaat ctcactgatt 
1141 tttgaaaaga tcagcaagtc agaggagagg aggacacagt gtgttggctg gcactctgtg 
1201 gagaacagat gggaccagca ggcctgcaaa atgattcaag aaaactccca gcaagctgtt 
1261 tgcaaatgta ggccaagcaa attgtttacc tctttctcaa ttcttatgtc acctcacatc 
1321 ttagagagtc tgattctgac ttacatcaca tatgtaggcc tgggcatttc tatttgcagc 
1381 ctgatccttt gcttgtccat tgaggtccta gtctggagcc aagtgacaaa gacagagatc - 
1441 acctatttac gccatgtgtg cattgttaac attgcagcca ctttgctgat ggcagatgtg 
1501 tggttcattg tggcttcctt tcttagtggc ccaataacac accacaaggg atgtgtggca 
1561 gccacatttt ttgttcattt cttttacctt tctgtatttt tctggatgct tgccaaggca 
1621 ctccttatcc tctatggaat catgattgtt ttccatacct tgcccaagtc agtcctggtg 
1681 gcatctctgt tttcagtggg ctatggatgc cctttggcca ttgctgccat cactgttgct 
1741 gccactgaac ctggcaaagg ctatctacga cctgagatct gctggctcaa ctgggacatg 
1801 accaaagccc tcctggcctt cgtgatccca gctttggcca tcgtggtagt aaacctgatc 
1861 acagtcacac tggtgattgt caagacccag cgagctgcca ttggcaattc catgttccag 
1921 gaagtgagag ccattgtgag aatcagcaag aacatcgcca tcctcacacc acttctggga 
1981 ctgacctggg gatttggagt agccactgtc atcgatgaca gatccctggc cttccacatt 
2041 atcttctccc tgctcaatgc attccaggta agtccagatg cttctgacca agtgcaaagt 
2101 gagagaattc atgaagatgt tctgtag 
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Novel human G protein-coupled receptors with long N-terminals 
containing GPS domains and Ser/Thr-rich regions. 

Fredriksson R, Lagerstrom MC, Hoglund PJ, Schioth HB. 

Department of Neuroscience, Uppsala University, BMC, Box 593, 751 24, 
Uppsala, Sweden. 

We report eight novel members of the superfamily of human G protein-coupled 
receptors (GPCRs) found by searches in the human genome databases, termed 
GPR97, GPRllO, GPRlll, GPR112, GPR113, GPR114, GPR115 and 
GPR116. Phylogenetic analysis shows that these are additional members of a 
family of GPCRs with long N-termini, previously termed EGF-7TM, LNB- 
7TM, B2 or LN-7TM. Five of the receptors form their own phylogenetic cluster, 
while three others form a cluster with the previously reported HE6 and GPR56 
(TM7XN1). All the receptors have a GPS domain in their N-terminus and long 
Ser/Thr-rich regions forming mucin-like stalks. GPR113 has a hormone binding 
domain and one EGF domain. GPRl 12 has over 20 Ser/Thr repeats and a 
pentraxin domain. GPRl 16 has two immunoglobulin-like repeats and a SEA 
box. We found several human EST sequences for most of the receptors showing 
differential expression patterns, which may indicate that some of these receptors 
participate in reproductive functions while others are more likely to have a role 
in the immune system. 
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The human and mouse repertoire of the adhesion family of G- 
protein-coupled receptors. 

Bjarnadottir TK, Fredriksson R, Hoglund PJ, Gloriam DE, Lagerstrom 
MC, Schioth HB. 

Department of Neuroscience, Uppsala University, BMC, Box 593, 751 24, 
Uppsala, Sweden. 

The adhesion G-protein-coupled receptors (GPCRs) (also termed LN-7TM or 
EGF-7TM receptors) are membrane-bound proteins with long N-termini 
containing multiple domains. Here, 2 new human adhesion-GPCRs, termed 
GPR133 and GPR144, have been found by searches done in the human genome 
databases. Both GPR133 and GPR144 have ^ GPS domain in their N-termini, 
while GPR144 also has a pen traxin domain, The phylogenetic analyses of the 2 
new human receptors show that they group together without close relationship 
to the other adhesion-GPCRs. In addition to the human genes, mouse 
orthologues to those 2 and 15 other mouse orthologues to human were identified 
(GPRllO, GPRIU, GPR112, GPR113, GPR114, GPR115, GPR116, GPR123, 
GPR124, GPR125, GPR126, GPR128, LECl, LEC2, and LEC3). Currently the 
total number of human adhesion-GPCRs is 33. The mouse and human 
sequences show a clear one-to-one relationship, with the exception of EMR2 
and EMR3, which do not seem to have orthologues in mouse. EST expression 
charts for the entire repertoire of adhesion-GPCRs in human and mouse were 
established. Over 1600 ESTs were found for these receptors, showing 
widespread distribution in both central and peripheral tissues. The expression 
patterns are highly variable between different receptors, indicating that they 
participate in a number of physiological processes. Copyright 2003 Elsevier Inc. 
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EXHIBIT C 



characterize the protein. A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
where the final product is not supported by a specific and substanU^^ 
In this case hone of the proteins that are to be produced as final products 
resultmg from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Apphcants to characterize potential protein products, especially their 

biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not define a "real world" 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted ufility for the reasons set forth above, 
credibility has not been, assessed: Neither the specification as filed nor any 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 1 12, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established ufility for the reasons set 
forth above, one skilled in the art would not knowhow to use the claimed 
invention. 

Example 10: BNA Fragment ^n^nHinP a Full Open heading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NQ: 2 has ahigh level of homology to a DNA Hgase. Th specification 

teaches that this complete ORF (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a sunilarity score of 95%. A search 
ofthe prior art confirms that SEQ ID NO: 2 has high homology to DNA ' 

Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

1) Based on the record, is there a "well established utility" for the 
claimed invention? Based upon applicant's disclosure and the results ofthe 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 
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Note that if there is a well-established utility aheady associated with the 
claimed invention, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determme that the invention has a 
specific, substantial and credible utility that would have been readily 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. § - 
101 rejection and a 35 U.S.C. § 1 12, first paragraph, utility rejection should 
not be made. 

Example 11: Animals with TTncharacterized Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
ofthe mice to research human genes from diseased human kidneys. The 

disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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Pep tide-binding G protein-coupled receptors: new opportunities 
for drug design. 

Gurrath M. 

Heinrich-Heine University, Pharmaceutical Chemistry, Universitatsstr. 1, 40225 
Dusseldorf, Germany, gurrath@pharm.uni-duesseldorf.de 

Over the last decades distinct members of the G Protein-Coupled Receptor 
(GPCR) family emerged as prominent drug targets within pharmaceutical 
research, since approximately 60 % of marketed prescription drugs act by 
selectively addressing representatives of that class of transmembrane signal 
transduction systems. It is noteworthy that the majority of GPCR-targeted drugs 
elicit their biological activity by selective agonism or antagonism of biogenic 
monoamine receptors, while the development status of peptide-binding GPCR- 
addressing compounds is still in its infancy. Exemplified on selected medicinal 
chemistry projects, this review will focus on the opportunities of therapeutic 
intervention into a broad spectrum of disease processes through agonizing or 
antagonizing the functions of peptide-binding GPCRs. In this context, a brief 
overview of GPCR-mediated signal transduction pathways will be given in 
order to emphasize the biomedical relevance of a controlled modulation of 
receptor function. Modem trends on lead finding and optimization strategies for 
peptide-binding GPCR-targeted low-molecular weight compounds will be 
highlighted on the basis of current research programs conducted in the areas of 
angiotensin II, endothelin, bradykinin, neurokinin, neuropeptide Y, LHRH, C5a 
antagonists, and somatostatin agonists, respectively. Special emphasis will be 
laid on the elaboration and utilization of structural rationales on the potential 
drug candidates, thus facilitating more detailed insights into the underlying 
molecular recognition event. 

Publication Types: 

• Review 

• Review, Tutorial 
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THE HUMAN GENOME 
,»nre of the euchromatic portion of 
A 2.91-bUUon base pair (bp) consensus ^^ome shotgun sequencmg 

the human genome was generated by ^'J^^^ted over 9 months from 
rnethod.The 14.8-bUUonbp ^NA 5eq"en«^^^^^^^ of the genome) 

Z7.271.853 Wgh-quality sequence reads (5^1 t mdividuaU.Two 

from both ends of ^"^n^!?^'^^^^^^^^ and a regional chromosorne 

assembly strategies-a ^^lole-genome assem^^^^^ ^^^^ ^^j^^^ the 

gjjembly-were used, each "'"^''"'"8 *f 'l"^"^* shredded into 550-bp 
publSy funded genome effort. The P"" 

Segments to create a ^^S:^; fn th" Zoning and assembly 

sequenced, without includmg ^^'^'^f^^^^^^^ brought the effective coy- 
procedure uied by the pubUcy fu^^^^^^^^ 

erage in the. assemblies ^ S^^^^^ Jv^'Sried with 5.11-fold coverage. The 

the final assembly over Y'^f ^^^Ua? res^ ts that largely agree with 

two assembly strategies yielded ^^^Lb ^fef^^^^^^^ 

STdependent mapping data. The assembles eff^^^^^^^^ ^^^^^^ 

r^aions of the human chromosomes. More xnan genome is in 

:S Assemblies of 100.000 bp or n^^^^^^^^^ Z^^^^^f f„,, led 

scaffolds of 10 million bp or l^^^^rf "'S^ strong corroborating 

26 588 protein-encoding transcript for which tner^^^^^^^^^^^^ 

Sceandanadditlonal-ia^OOO^^^^^^^^ 

matches or other wea^5"PP°'^'"£^;,„ "^ed in sequence separated 

obvious, almost half the ^X'^^JZl s^^Zc.. Only 1.1% of the genome 
by large tracts of apparently noncodmg se^^^^ 

is spanned by exons. whereas 24% '^"^bS.ks. ranging in sire up to chro- 
Intergenic DNA. Duplications °/"Sout the genome and reveal a complex 
mosomal lengths, are abundant throughou^^^^^ .^^^^^ ^^n^Uate ex- 

Tvolutionaiy history. Comparative genorn.c anaJJ ^.^^ ,5,,„e-specif ic de- 

pansions of gene? associated ^ "'"'°^'^3;3s5s and immune systems DNA 
Jelopmental regulation;-and with the hernost ^^^^^^^ 
sequence comparisons between thexons^^^^^^^^ 

genome data provided locations of 2.1 ^j^f ^.^d at a rate of 1 bp per 

fsNPs). AranSom P-"" ^^^'^'^^^.^'St d hetTrog^ In the level of poly-- 
1250 on average, but there was '"^^^^'iffl^ all SNPs resulted in variation in . 

^r^SSTbut^-^Sre^^^^^^ 
remains.an open challenge. 



Decoding of the DNA that constitutes the 
hrrgenome has been vridely anticipated 
Lntribution it will make toward un- 
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S pSpo^ed in 1985 (i). In subsequent 
S Sa met «-ith mixed reactions in 
^rsVJnti^ c6nununity (2). However m 
i 990 the Human Genome Project (HGP) was 

£:arrrut?cXnt of Energy 

S^?ycar.S3^^i^ng^forcom^^^ 
the genome sequence^ ^199^we^ 

"Srfa UV to define 'the se- 
sequencmg fa«U^t ^^^^ ^ 3.year. . 

Herf we^rt the penultimate mile^ 
rnealSthTpZoWtbaJ-goalanearly 
Stone along euchromatic por- 

■ rfcnhr^r^gcno-^^^ 

.. tion of the buman g random 
•VS^tr^iwiiieqJen-semblyof 

^^'^^^rSem'Sr^'of DNA sequencing 
w :nTo77 whenSangerrcportedbismelh- 



usin" chain-terminating nucleotide ana- 
. ; r^^^ Ae same year, the first human gene . 
Si^S»-d (4). in 19S6. Hood 

in the Sanger sequencing method that mduded 
Tttachina fluorescent d>es to the nucleotides 
wSS^mutted them to be -quent.^ 
by a computer. The fiist automated DNA se 
qLncer.dWd by Applied ems - . 

LlifomiaiAl987.w-asshowntob^-.-^^^^ 
whenthesequencesoft\vo genes weteoo 

^vith this new technology i<^)-:^'T„^}^y ], 
Quencin.' of human genomic tc&°J^ (7).Jl 
EedearthatcDHAs^^^^^^ 
reverse-transcnbed from KNa; v.oiu 
rntialtoannotateandv^^tege^^^^^^ 
in the human genome. These siuai 

basis m part for */;'g^PSc^^5 ge'^e 
pressed sequence tag ^SIJ mem ^ 
identification («). which is a "^^^J^^to 

man EST sequences D««ssitateQ m y 
ment of new computer ^ggfat . 

large amounts of sequence data, "L 
Th^ Institute for Genomic Research (TIGK;. an 

nthm- In I^s*^' " completed by a 
f„;7»enrae genome was co ^ ^j^^^ 

■• (also called mate P^'^^)'^J"/3\rt sizes and 
clone libraries -^^^'^^^^ s.<i^^^no.s 
cloning characteast,cs.Pau^^^ 

are sequences 500 ^° ^Y^mih clones of 
both ends of double^tranded DNA c^^ 

prescribed lengths. Tb^^^^^.i^g^'^ao kbp) 
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neously map and sequence the human ge- 
nome by means of end sequences "from 150- 
kbp bacterial artificial chromosomes (BACs) 
(i7, 18), The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAG end-sequencing (EES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome {19), 

In 1997, Weber and Myers (2(?) proposed ' 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human genome sequencing worldwide 
was very slow {22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystcms) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
qucnUy called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR {23). Many of the principles of operalioh 
of a genome-sequencing facility were estab- 
lished in the TIGR facility {24), However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible {25), The 
DrosophUa melanogaster genome was thus 
cKosen as a test case for whole-genome assem- 
bly on a large and complex- e'ukaiyotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley DrosophUa Genome Project, the nu- 
cleotide sequence bf the 120-Mbp euchi-omatic 
.^--portion oT itio DrosophUa genome -vvas det^r:' . 
mined over a 1-year period {26-28). The Dro- 
sophUa genome-seqlietocihg effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chroniosome assemblies 
with highly accurate order and orientalion with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one compirehensive final assembly was 
not of value. 

These findings, 'together, ^vith the dramatic 
changes in the public genome effort subsequent 
to the fonmation of Celera {29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-ycar period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 
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coverage and to use the unordered and unori- 
ented BAG sequence fiagments and subassem- 
blies published in GenBank by the publicly 
funded genome effort {30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- • 
blies to report . 

Although this strategy, provided a reason- 
able result very early that was consistent with a 

■ whole-genome shotgun .assembly with eight- . 
fold coverage, the human genome sequence is 

• not as finished as the DrosophUa genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov; 
erage strategy, Celera could generate an accu- 

* rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the ~3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 

■ menc clones, foreign DNA contamination, or - 
-misassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this. manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
fourid in Web fig.,1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
'5507/1 304/Dfciy provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad Sections. A sunimary of the 
major results appears at the beginning of each 
section. \ : > ; ' 

1 Sources of DNA and Sequencing Methods 

2 .Genome Assernbly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencmp 
Methods * 

Summary, This section discusses the rzitioniic 
and ethical rules governing donor sclcciion to 
ensure ethnic and gender diversity ^\ox\^ vkjih 
the methodologies for DNA extraction and li- 
brary construction- The plasmid library con- 
struction is the first critical step in shotgmi 
sequeiicing. If the DNA hbraries are not uni- 
fonm in size, nonchimeric, and do not randoml)' 
represent the genome, then the subsequent steps 
carmot accurately reconstruct the genome se- 
quence. We used automated high-throuirlipul 
DNA sequencing and the cqmputalioruil infra- 
. structure to . enable efficient, tracking of cnor* 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and the 
Worid Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (57) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the-informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. Wc 
adopted several steps and procedures lo pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the, subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
appliedTor'and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the mi- 
tial.-version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis.- to self-designate an ethnogeographic 
category (e.g., Afiican-Ainerican, Chinese. 
Hispanic, Caucasian, etc.), We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex, and self-designated ethnogeograph»c 
group. From females, -130 ml of whole 
heparinizcd blood was collected. From malci*- 
-130 ml of whole, hcparinized blood 
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coUected, as well as five specimens of se. 
collected over a 6-week period. Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalizatioa. DNA 
from five subjects was selected for genoniic 
DNA sequencing: two males and three fe- 
males—one Afncan-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at wvw.scienccmag.org/cgi/content/291/5507/ 
1304/DCl). The decision of whose; DNA to 
sequence was based on a complex mix offac-. 
•* tors, includingthe goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA Ubraries and availability of immortal- 
ized cell lines. . ' 

1.1 Ubraty construction and 

sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparadon of high-quality plas- 
mid Ubraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid msert 
High-quality Ubraries have an equal representa- 
tion of all parts of the genome, a smaU number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid Ubrar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (53). 

In designing the DNA-sequencmg pro- 
cess we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and momtored ef- 
fectively (Fig. 2) {S4y . . , 
Cuixent sequencing protocols are based on 



the dideoxy sequencing method (55). which 
typicaUy yields only 500 to 750 bp of sequence 
pir reaction. This Umitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
•feet of laboratory space and P^oduc^ sequen^^ 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing facdity is 
.supported by a high-perfonnance computation. 

al facility"(55). > ; • ' ..• , • • . 
-.-nieproccsiforDNAsequencmgwasrnod- • 

ular by design and automated, Interrnodule 
sample backlogs allowed four pnncipal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and ou^,uts 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's intemiption sincethe 
initiation of the Drpsophila project m May 
1999 The ABI 3700 is a fiilly automated 
capillary array sequencer and as such can 
be operated with a minimal amount -ot 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ina traces wth samples through the elimi- 
nafioh of manual sample loading and lane- 
- • tracking errors associated with sUb gels 
. About 65 production staff were hired and 
trained, and were rotated.on a regular basis 



hrough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
umque bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of- all software and instrumentation, before 
Mmplertientation; and production-scale testing . 
' of any process changes. " ; . - . • 

1.Z Trace processing 
An automated trace-processing pipeline , has 
been developed to process each sequence file 
(37). After quality and vector trinmung, the 
average trimmed sequence length was . 543 
bp, and the sequencing accuracy v/^ expo- 
nentially distributed with a mean of 99.5 /o 
and with less than 1 in 1000 reads bemg less 
than 98% accurate {26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone^. 
coli genomic DNA, and human mitochondn- 
al DNA. The entire read for any sequence 
with a significant match to a contanrunant was 
discarded. A total of 713 reads ^^^^^^X '. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 
The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-gencratcd data input Into assembly. 



Number of reads for different Insert libraries 



Total number of 
base pairs 




No. of sequencing reads 



Fold sequence coverage 
. (2.9-Gb genome) . . . 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SO) 
% Matest 



A . 
B 
C 
D 

..Total 

"a • 

B 

C 

"D 
F 

'* Total 

A 

B 

C 

D' 

F 

Total 
Average 
Average 

Average 



• 0 
11.736,757 
853.819 
952.523 

13.543.099 
0 

220 
0.16 
0.18 

. 0 



0 

7.467.755 
881.290 
1.046.815 
, .1.498.607 
10.894.467 



2.767.357 
66.930 
0 
0 
0 

2.834.287 



2.767357 
19.271.442 
1.735.109 
1,999.338 
1.498.607 
27.271.853 
6.52 
. 3.61 
032 
037 
0.28 
5.1 1 



1,502.674.851 
10,464.393.006 
942.164,187 
1.085.640.534 
813.743.601 
14.808.616.179 



•Insert size and SO are calcuUted from assembly of mat^T^ontig. 
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nome, and even a modest .:crror rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions, proceeded through the 
process, including strict rules built into the • 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26), By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly -Strategy and . . 
Characterization 

Summary, We describe in this section the two 
• approaches that we used to assemble the ge- - 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBaiik to generate an indepen- 



dent nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information- The clustered data 

• were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 

. DNA sequence with proper order and orienta- 
tion. The. second method, provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 

.phase; In addition, we document the complete- 
ness and correctness of this assetnbly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are 
selected, and processed In compliance v/Uh standard ^P/f^^'"^^^ 
dures. with a focus on quality withm and across ^^^^partrnents. Each 
prcKess has defined InputJ and outputs with the capability to exchange 



samples and data with both Intemal and external entitles acc^ 
defined quali^ guidelines. Manufacturing pipeline processef 
quality control measures, and responsible parties are Indic' 
described further In the text. 



and provide a comparison to the public gei. ; 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads "randomly sampled from a target 
sequence, reconstruct the order and the p6-. 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the ~25-fold larger human genome. Celera as- 
semblies consist of a set of condgs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by usmg 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
compUshed by observing'fiiat a^air of reads, 
one of which is in one contig. and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3)- Fi- 
nally, our assemblies did not incorporate all 
reads into the final iset of reported scaffolds.* 
• This set of unincoiporated reads is termed 
"chaff," and typically consisted of reads from 
within hi^y repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with imtrinuned vector. 



2.1 Assembly data sets 
We used tvr-o independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 millioa reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 h"braries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2. 
10, and 50 kbp were used. By looking at how 
mate pairs from a library were positioned m 
known sequenced stretches of the genome, we ^. 
• were able to characterize* the range of insert 
' • sizes, in each library arid detemiine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencmg coverage, and clone .cov- 
erage achieved by the data set. The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2,9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the pubUcIy 
frmded Human Genome Project (PFP) and is 
primarily derived from BAG clones (50). The 
BAG data input to the assemblies came from a 
dovmload of GenBank on 1 September 2000 
(Table - 2) totaling 4443.3 Mbp of sequence. 
The data for each BAG is deposited at one of 
fdiir levels of completion. Phase 0 data are a set . 
\ of generally unassembled sequencmg reads 
from a very light shotgun of the BAG, typically 
less than IX. Phase 1 data are unordered as- 
sembUes of contigs, which we call BAG contigs 
or bactig's. Phase 2 data are ordered assembhes 
of bactigs.* Phase 3 d^ta are complete BAG 
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equences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 dau 
from a 3X to 4X light-shotgun of each BAG 
clone. ^ 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm - 
against three data sets: (i) vector sequences 
m Univec core (55). filtered for ia 25-bp 
. match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the. sequence; Xit) the nonhuraan portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (5P), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2). 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAG end-sequence 
mate pairs were also downloaded and m- 
cluded in the data sets for both assembly 
processes {IS). 

2.2 Assembly strategies . 
Two different approaches to assembly were 
pursued. The first was a whole-genome as- . 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
- . localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
ovefall.process flow. ^t,, pfp 

For the whole-genome assembly, the 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550.bp reads tiiat 
form a perfect 2X covering of.the bactigs. This 
resulted in 16.05 milUon ' W reads that were 
sufficient to cover tiie genome 2.96X because 
of redundancy.in the BAG. data set, without; 
incoiporating the biases inherent in the 
assembly process. The combined data set of 
4332 niillion reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstmction of the genome. Neither 
the location of a BAG in the genome nor 
assembly of bactigs was used in this process 
Bactigs were shredded into reads because w 
found strong evidence that 2.13% of ^eni were 
misassembled (^0), Furthermore, BAG locauon 
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information was ignored because some BACs 
were not conccdy placed on the PFP physical 
map and because wc found strong evidence that 

Table 7L GenBank data Input into assembly. 
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at least 22% of tiie BACs contained sequence 
data that were not part of the given BAG {41), 
possibly as a result of sample-tracking errors 



Center 



Statistics 



Completion phase sequence 
0 . 1 and 2 



Whitehead Institute/ 
MIT Center for 
Genome Research. 
USA 



Washington University; 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility. DOE Joint 
Genome Institute, 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN), 
Japan 



Sanger Centre, UK 



Others* 



All centers combinedt 



Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 

'.Average contig length (bpj • 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average'cbntig length (bp) 

Number of accession records 
Numbecof contigs* 
Total base pairs. ■* 
Total vector masked (bp) 
Total contaminant masked 

* t^rf . , w /K ^ 
Average contig length (opj 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 

Total vector masked (bp) ' 

.Idtal contaminant masked (bp), . 
-Average contig length (bp) • 
. Number of accession records 

Number of contigs 

Total base pairs 
--Total vector masked (bp) 

Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contlgis 
Total' base pairs 
Total vector masked (bp) . 
Total contaminant masked 

(bp) ; 

Average contig length (bp) 



2,825 
243,786 
194,490,158 
1,553,597 
13,654,482 

'798 

19 
2,127 
1,195,732 
21.604 
22,469 



6,533 
138,023 
1,083,848,245 
875,618 
4,417,055 



363 
363 
48.829,358 
2.202 
98,028 



'7.853 . 134,516 



3,232 
61,812 
561,171.788 
270,942 
1.476.141 



1,300 
1,300 
164.214,395 
8.287 
469.487 



562 

0 
0 
0 
0 
0 



9,079 ' 126,319 



1.626 
44,861 



363 
363 



265.547!066 49,017.104 



218,769 
1.784,700 

5,919 

2,043 
34,938 

8,680'il4 294.249.631 
• .22,644 . 162,651 
665.818 4.642,372 



135 
7.052 



1.231 

0 
. 0 
0 
0 
0 
0 

0 
0 
0 
0 

... 6 

0 

" * 42 
5.978 
5.564,879 
57.448 
•575.366 

931 

3.021 
258,943 
209.930.983 
1.655.293 
14,918.135 

811 



8.422 

1.149 
25,772 
182,812.275 
203.792 
308.426 
7,093 

4,538 
74,324 
689.059.692 
427.326 
2.066.305 
9.271 

1.894 
29.898 
283.358.877 
279.477 
1.616.665 



4.960 
. 485.137 

135,033 

754 
754 
60,975,328 
7.274 
118.387 

80.857 

300 
300 
20.093.926 
2.371 
-.27.781 
66,978 

2.599 
2;599 
246.118.000 ; 
25.054 
374.561 
94.697 

3.458 
3.458 
246.474.157 
32.136 
1.791,849 



9,478 '71.277 



.-21.015 
*' ' ' '409.628 
3.360.047,574 
2.438.575 
16311.664 



8.203 



9.137 
9.137 
835.722,268 
. 82.284 
3;365.230 

91.466' 



.other cnu. contHb^tln, at 0.1|. ^a^mK ^^^^^ 
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^r^ded Into faux reads resulting In 2iJ6X coverage of the genome. 



(see below). In short, we performed a true, ab 
^initio whole-genome assembly m which 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

-In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitions! 
into the largest possible chromosomal segments 
or "components'* that could be delemiined wii>. 
confidence, and then shotgun assembly was op- 
plied to each partitioned subset wherein the 
bactig data v/ere again shredded into faux read: 
to ensure an independent ab initio assembly o! 
•the component By subsetting the data in ihi: 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli 
cations was ameliorated This also resulted in i 
reconstruction of the genome that was relalivcl; 
independent of the whole-genome assembly rc 
suits so that the two assemblies could be com 

- pared for consistency. Tbe^quality of the parti 
tioning into components was crucial so lha 
different genome regions were rot mixed to 
gether. We constructed components from CO il» 
longest scaffolds of the sequence from caC 
BAG and (ii) assembled scaffolds of data unitiu 
to Celera's data seL The BAG assemblies wcr 
obtained by a combining assembler that used th 

- bactigs and the 5X Gelera data mapped to thos 
■ bactigs as input This effort was undertaken v 

an interim step solely because the more accurst 

and complete the scaffold for a given scqucm 
stretch, the more accurately one can tile hc^ 
scaffolds into contiguous components on u- 
basis of sequence overlap and mate-patr inU' 
mation. We further visually inspected and t 
S ihe scaffold tiling of the con^pone^^^^^ 
Srther increase its accuracy. For the final CS. 
assembly, all but the partidomng v.^^^^^^^^^^ 
and an independent, ab imDo recons» 
the sequence in each component ^vas cWnm 
by applying our whole-genome 
' ri^to the partitioned, relevant Celeia daln » 
SeThreddei faux reads of the partmoncd. rc 
evant bactig data. 
2.3 Whole-genome assembly 
The algorithms used for whole-gcnom'^^. 
sembly (WGA) of the hum-" ^^^JJ^ . 
enhancements to those used ^^V^^^ 
sequence of the Drosophila genome rcpo 

in detail in (25). - -^y, 

The WGA assembler consists ° 
composed of five pri^ciP^ 
Overiapper. UniUgger. Scf older ami « i; 
Resolver. respectively. TTie Serine 
and marks all mlcrosatelhte rep^js w- 
than a 6-bp element, and screen 
known interspersed rep«t .c lements, 
ing Alu. Line, and ribosomal Dj^fv,)- 
regions get searched for overlaps. ^ 
screened regions do not get searched / 

be part ofan overlap that mvolvcs / 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete- 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match.. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insFsVon com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10.000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM- .This. took 4 to 5 
days in. elapsed time with '40 such maclpnes . 
operatirig in parallel.- . ... ; . • ■. • * •* 

Every overlap computed above is statistiT 
cally a 1-in-lO^^ event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps arc actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeal-induced overlaps, especially early 
in the process. . _ 

We achieve this objective* in the Unitig- 
ger. We fust find all assemljlies pjjreads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for umquely as^ " 
sembled contigs). Fonmally, these unitigs are 
the uncontested interval subgraphs of. the 
graph of all overlaps (42). Unfortunately; al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too hi^^to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to* a sufficiently 
""stringent thrcshold,-identifies z subset of the- 
unitigs that we are. certain are correct In 
addition, a second; -less stringent threshold 
identifies a subset of remaining uniligs very 
likely to be correctly assembled, pf which we 
select those that will consiste'n'tly scaffold 
(see below), and thus are again aUnosl certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that wc get U-iinitigs covering 98% of the 
stretches of unique DNA that .arc >2 kbp 
long. We are frirther able to identify the 
boundary of the start of a repetitive clement 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu c'lements and other 
lOO-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair iiifora[iation to link these to- 
gether* into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
" orientation with respect to each other, the ; 
• probability: of -this being v.Tong^ is again ' 

roughly 1 in 10*°, assuming that mate pairs 
. are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least nvo 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confuming..50-kbp mate 
pairs and BAG end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a 
genome. 

For the Drosophlla assembly, we engaged 
in a . three-stage repeat resolution strategy 
where each stage;. was progressively . more 



S.UXCelera Reads 
• !^Qy mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the fust "Rocks" substage where 
all unitigs with a good, biit not defmitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads akeady in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the, probability of inserting a uiiitig into an 
incorrect gap with this strategy to be less than *. 
10"'' b^ed on k probabilistic analysis. 
. We revised the ensuing VStones" substage 
. of the human assembly, making it more like 
the mechanism suggested in our earlier work 
{4S). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R*s 
placement is collected. Celera*s mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set -belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads ^ithin the 
.gap, climinatmg any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophlla assembly; in the assembly of a 
-. simulated shotgun data set of human chromo- 



Publlc Bactiqs 
(frnm 33.421 BACs) 




Bactigs & Ce!era pairs 
(binned by BAC) 



Combining ^ 
Assembler ^ 




Components^ 




Components^ 




V Components^ 





WGA Assembly CSA Assembly ^- f „ 

Hg. 4. Architecture of Celera'stwo-pr^^^^^^^^^ 

S""4^,„^/-»u!J%rr ^^^^^ l^dt^iJnsu'Ld by a process. This figure 

fu" marifes *e dS^^^^^^ SthSext th'at defines the tera« and phrases used. 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assemble'S BAG data that cover 
the gap. We call this external gap '*ft'alklng." 

* We did not include the veiy aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enougih mistakes so as to 
produce repeat reconstructions for long inter- . 
spersed elements whose, quality was only 
99.62% correct. We decided that for the hu- 
man genorde it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 
. At the final stage of the assembly process, 

' and also at several intennediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Cclera dau when- 
ever it is present. InT'the event that no Cclera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a AVG A of the 
human genome was to parallelize the Overlap- 
per and the central consensus s«^u^nce-con: 
.stnicting subroutines. In addition, memory was 
a .real issue — a straightforward ^plicadon of . 
the softvy^e we had built for i?r<?5o;?Aifa would • 
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have required a computer with a 600-^gabyte 
RAM. By raaldng the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of JRAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to condnualiy update the state of this 
part of the computation as data were delivered 
and then perfomi a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
"sired. "For our assembly 'operatioris, >the total 
compute iofiastmcture consists of 10 four-pro- • 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) aiid a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered \\21 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 
scaffolds .>100 kbp long, and these averaged 
91% sequence and 9% gajps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold size was . 1.5 Mbp, 
the average cpntig size was 24.06 kbp, arid the . 
average gap size was 2.43 kbp, where the dis- 



tribution of each was essentially exponential 
' More than 50% of all gaps were less than 50C 
bp long, >62% of all gaps were less than 1 kbj 
long, and no gap was >100 kbp long. Similar- 
ly, more than 65% of the sequence is m contigs 
>30 kbp, more than 31% is in contigs >100 
. kbp, and the largest contig was 1,22 Mbp long. 
Table 3 gives detailed summary statistics for 
• the '.stcucture of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly." 

2.4 Compartmentalized shotgun 
assembly 

In addition to the V/GA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculatirig U-imitigs. The compart- 
mentalized assembly process' involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular .PFP BAC 
entry, and those that did not match any public 
data.- Such matches must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.' 



Scaffold size 



AU 



>30 kbp 



>100 kbp 



>500 kbp 



>1000 kbp 



No. of bp In scaffolds 

(including Intrascaffold gaps) 
No, of bp In contigs 
No. of scaffolds 
No. o^ contigs • i 
No. of gaps ' \i 

No. of gapr^l kbp 
Average scaffold size (bp) 
Average r6ptig size (bp) 
Average Intrascaffold gap size 

(bp) 

Largest contig (bp) v 
% of total contigs 

No. of bp In scaffolds 

(including Intrascaffold gaps) 
No. of bp In contigs 
No. of scaffolds .'i . 
No. of contigs 
No. of gaps 
No. of gaps:Sl kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average Intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 



2.905.568.203 

2.653.979.733 
53.591 
170.033 
116.442 
72.091 
54.217 
15.609 
2.161 

1,988.321 
100 

2,847,890390 

2.586.634.108 
. 118.968 
221.036 
102.068 
62.356 
23.938 
11.702 
2.560 

1,224.073 
100 



Compartmentalized shotgun assembly 

2.748.892.430 2,700.489.905 



2.524.251.302 
. 2.845 
, . 112.207 
109.362 
69.175 
. 966.219 
22.496 
• 2.054 

1,988.321 
95 



2.491.538,372- 
1,935 
107,199 
105.264 
67.289 
1,395,602 
23.242 
1,985 

1.988,321 
94 



VWiofe-yertome a^embly ' I 
2.574.792;618'. . .. 2,525,334.447 



2334343,339 
2,507 
99,189 
96,682 
60.343 
1,027,041 
23,534 
2,487 

1,224,073 
90 



2,297,678,935 
1.637 
95.494.. 
93.857 
59.156 
1.542,660 
24.061 
2.426 

1,224.073 
89 



2,489,357,260 

2,320,648,201 
1,060 
93,138 
92.078 
59.915 
2,348,450 
24,916 
1.832 

1.988.321 
87 



2328.535,466 

2,143,002.184 
818 
84.641 
* 83.823 
54.079 
2,846.620 
25,319 
2,213 

1,224.073 
83 



2.248,689.128 

2,106.521,902 
721 
82.009 
81,288 
• 53.354 
3.118.848 
25.686 
1,749 

1.988,321 
79 



2.140.943.032 

1,983305.432 
554 
76.285 
75.731 
49,592 
3.864.518 
25.999 
2.082 

1,224.073 
77 
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property place a Celera read, so all reads were 
first masked against a libraiy of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 inilUon 
reads. 20.76 million matched a bactjg and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig s 
BAG because their mate matched flie bactig. 
Of the remaining reads. 2.92.million were, 
completely screened out and so cduld ^ot b.q 
. matbhed, but the other 2.97 million reads had ■ 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GcnBank data se . 
Because the Cetera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of umque Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAG entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result %yhose 
utility was simply to provide more reliable 
information for the purposes of their tUmg 
into^ets of overlapping and adjacent scaffold 
sequences in the next step. In outlme, he 
combining assembler first exammes the set of 
matching Celera reads to determme-if there 
are excessive pilcups indicative of un- 
screened repetitive elements. NVherever these., 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positipns _ 
are removed. Then all sets of mate pairs that . 
consistently imply the same relative position 
of nvo bactigs are bundled into a link arid 
weighted according to the number of mates m 
the bundle. A "greedy" strategy then attempts 
1 to order the bactigs by selecting bundles ot 
' mate-pairs in orderof their weight. A selected 
mate-pair bundle can tie together Uvo fomna- 
tive scaffolds. It is incorporated to fom a 
single scaffold only if it is consistent with the 
majority of links bet%veen contigs of the scat- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy descnbed 
above for the WGA assembler. 

The GenBank-dala for the Phase i-and I 
BACs consisted of an average of 19^8:bacligs 
per BAG of average iize 8099 bp. Appl.ca- • 
tion of the combining assembler resulted in 
individual Celera BAG assemblies b^mg put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting an av"" 
age of 8.57 contigs of average size 18.973 bp. 
In addition to defining order and onentatton 
of the sequence fragments, there were 57 /» 
fewer gaps in the combined result For Phase 
0 data, the average, GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not "ough CeU« 
data were matched to truly assemble the O.i x 
to IX data set represented by the topical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of fsf?" 
bly and localization of the Celera reads. Tlie 
phase 0 data suggest that a combined whole- 
genome shotgun data «t and IX light-shot- 
Kun of BACs will not yield good assembly ot 
BAG regions; at least 3 X light-shptgun of 
each BAG is needed. ... ... - : ' . • 

■ . The 5.89 rmllion Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
I^mbly resulted in a-set of scaffolds totals 
: 442 Mbp in span and consistmg of 326 Mbp 
of seque'iice. More than 20% of the scaffo ds 
were >5 kbp long, and these averaged 63 /o 
sequence and 27% gaps with a total of 302 
Kftp of sequence. All scaffolds >5 kbp were 
fbr^arded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or t^^o 
• scaffolds for every BAC region constmitmg 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAG and Celera^q- 
scaffolds across the genome, ^^r this v.e 
"used Celera's 50-kbp mate-pairs information, 
■ and B AC-end pairs (18) and sequence tagged 
site (SIS) markers (44) to provide long- • 
'range guidance and chromosome separation^ 
Given^e relatively manageable number of 
scaffolds, we chose not to produce this tUng 
S a f^lly automated manner, but to compute., 

iiSial tilGig with a good heuristic and ^en 
use human curators to resolve discrepances 
or missed join opportunities. To »his end. we 
developed a graphical user mterface Aat ^s- 
nlaved the graph of tiling overlaps and the 
■Sence fofeach. A human curator could 
•then explore the implication of '"^PP^d ^a 
data dot-plots of sequence overlap, and a 
visual display of the -mate-pair evidence sup- 
porting a giyeo.choic<s.. The result of tin 
■Pocess w£-a collection of "co^-^^f 
where each component 'was a hl^d «t °J 
BAG and Celera^ique .scaffolds that had 
been cuxator-approyed. The process resulted 
K 3845 components wiOi an estmiated span 
of 2.922 Gbp. _^ 

In order to generate the i^^-^^^^f 
assembled each component with ttie WGA 
Sori^- AS was done in the WGAproce^ 
S bactig data were shredded into a ^diehc 

2X shotgun data set in order to^give the 
^elibkf^e freedom to independently as- 
semble the data. By using faux reads rather 
baSgs the assembly algonthm could 
?o«ct S in the assembly of bactigs and 
Remove chimeric content in a PFP data entry. 



C jctic or contaminating sequence (from 
another part of the genome) would not be 
incorporated mto the reassernbly of the corn- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA priKess 
served only to bring together Celera fag- 
meats and PFP data relevant to a large con- 
tiguous segment of the genome, wherem we 
applied the assembler used for WGA to pro- - 
duce an ab initio assembly of the region. 
. WGA assembly of the components result- . 
ed in a set of scaffolds totaling 2;906. Gbp m ,• 
span and consisting of 2-654 Gbp of se- 
quence. The chaff, oi set of reads not incor- 
porated into the assembly, ^^^f^ ^'^^ ■ 
million, or 22%. More than 90.0%. of the 
genome was covered by scaffolds sp^g 
>100 kbp long, and ^^^"'Sed 92.2>^ 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105.264 gaps among the 107.199 contigs Uiat 

belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1^ lv«>P. 
the average contig size was 23.24 kbp. and 
the average gap size was 2.0 kbp where each 
distribution of sizes was «PO°«°"*l;^^^s 
: such averages tend to be underrepresentative 
S th'e majority of the data. Figure 5 shows a 
Ss togrJof *e bases in scaffolds of vanous 
size ^ges. Consider also that more than 
49% of all gaps were <500^bP »o°S' "^^^fj 
than 62% of all gaps were <1 kbp, and all 
gaps are <100 kbp long. Similarly, more tii^ 
73% of the sequence is in contigs > 30 kbp. 
. more than 49% is in contigs >100 »^^P' {^^ 
■ . the largest 6ontig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct companson to 
the WGA assembly. 



2.5 Comparison of the WGA and CSA 
scaffolds ^ ^ , 

Having obtained two assemblies of the bu- 
maii genome via independent computational 
proceles (WGA and CSA). we comp^ed 
scaffolds from the two assernblies as anoflier 
means of investigating their completeness, 
consistency: and contiguity. Frorn each as- 
sembly, a set of reference scaffolds contam- 
inTat Uast 1000 fragments (Ccl«a sequenc- 
inl reads or bactig shreds) w^ otlamed ttus 
ar^ounted to 2218 WGA scaffoh^ and 1717 
CSA scaffolds, for a toUl of 2-087 Gbp ^d 
• 2.474 Gbp. The sequence of each rcTcxcnce 
scaffold was compared to the «quence of aU 

scaffolds from the other assembly wathwhich 
it shared at least 20 fragments or at kast 20 /. 
of L fragments of the smaller scaffold. Fo 
• each such comparison, a" matches of at leas 
200 bp with at most 2% mismatch were 

"^'Som this tabulation, we esdmated tiie 
amount of unique sequence 5b ^^^^^^^ 
in ^vo ways. The first was to df 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3,95%) was not covered bythe CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any imiqueness of the matching segments. 
Tbus, another analysis was conducted in. 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches, having a : 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. ^- ' 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. AA.initial set of candi- 
dates was identified' automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
■ is in error and why. . • • 

In addition," we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on thTorder of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (01012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%)'ln the WGA assembly were incon- 
sistent with the CSA assemblj^. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
•one considers the increase of two-and-a-half.- 
orders of magnitude in problem size, the in- 
• : formation loss between the two is remarkably 
- small. Because CSA was logistically easier to 

• deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was perfomied on this assembly. ' ' 

2.6 Mapping scaffolds to the genome 
The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 

• that each scaffold will overlap multiple mark- 

• crs. There are two genome-wide types of map 
"infomiation available: high-density STS maps 

• and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge-.- 

• nome-wde STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping'scaffolds. The two different 
mapping approaches are complementafy to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-iange order, because the framework mark- 
ers were derived' from well-validated genetic 
maps. Both types of maps were used as z 
reference for, human curation of the compo- 
nents that wfjre the input to the regional assem- 
bly/ but* they idid' not detemiine the order of 
sequences produced by the assembler. 



<30kb . 30-50 kb 50-100 kb 



1-5 Mb 5-10 Mb > 10Mb 



100-500 kb 0.5-1 Mb 
Scaffold SIxe 

Fl£. 5. Distribution of scaffold sizes of the CS^Foreach range of scaffold ,Jz«.theperce^ 
sequence Is Indicated. 



In order to determine the effectiveness of 
the fmgerprint maps and GM99 for mapping 
scaffolds, we fust examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five framcr 
work bins. However, for the -fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC . 
locations in the scaffold -sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10. 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, -and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller, scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fmgerprint maps (11% of BACs 
disagreed with fmgerprint maps by more than 
five BACs). .This observation agrees with the 
^ clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
' Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence , of multiple 'mapped markers with 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99 A 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 

map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
fchorcd scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds detcrminca 
to be - Wappable" on the WashU map could 
be ordered relative to the anchored scatfoios 
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-with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors.were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered* relative to' each other, 'but cari be'. 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, —98% of the genome was an- 
chored, ordered, or bounded. 

'Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remainiing unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of uiimapped. scaffold 
lengths with the sum of the' number of 
mapped scaffolds, we arrived at an ertimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the ' 
chromosome. 

During the scaffold-mapping effort,'we en- • 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- . 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pscudogerfes. •■• /-"^ 

Because of the' time .required for-an' ex- 
haustive search for a perfect- overlap," CSA 
generated 21,607 intta^cacffold gaps where 
the male-pair data suggested that the conligs 
should overlap, but no overlap was found. 
These gaps were defmcd as a fixcd SO bp in 
length and make up 18-6% of the total 
116.442 gaps in the CSA assembly. 

We chose not to vise the order of exons. 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale, for not us- 
ing this data was that doing so would have 
biased certam regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene defmition processes more difficult. 



THE HUMAN GENOME 
/:.7 Assembly and validation* analysis 
We analyzed the assembly of the genome 
' from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
• order and orientation and the consensus se- 
quence of the 'assembly). 

Completeriess. Completeness is defmed as 
the percentage of the euchromadc sequence 
represented in the assembly. This cannot be ^ 
.■ known with absolute certainty until the eu- 
^chromatin;, sequence has been completed. ; 
However, it is possible to estimate complete- , 
ness on the basis of (i) the estimated sizes of 
inirascaffold gaps;' (ii) coverage of the two 
published chromosomes, 21 and 22 {48, 49)\ 
and (iii) analysis of the percentage - of an 
independent set of random sequences (STS 
markers) contained in .the assembly. The . 
whole-genome libraries contain hcterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of imique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been* completed to high quality 
and published (48, 49),, Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
- opportunity, to assemble it differently from 
the original sequence in the case of structural 
.,p61ymorphisms-or assembly errors in the 
r BAG data.. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components' (generally multimega- 
base in size), and so this conlparison reveals 
the level to \Vhich the assembler resolves 
repeats. In certain areas, the, assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
•Tmished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We exarnined the reasons why 
there are more gaps in the Celera sequence 
than in chromo^Tofnes 21 anS 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly; there are 25 scaffolds, each containing at 
least 10 kb of sequence,- that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining 'in the 
Celera assembly for these two chrpmqspmes 
i? 3:4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against-the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. . • 
A more global way of assessing complete- 



ness IS to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers firom Genemap99 
(5 J) to the scaffolds. Because 'these rxiarkers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) .and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by .searching the. lihas-. 
sembied daVor'"chaff,'?.We identified 1283 
STS markers (2.6%) not found in cither Celera 
sequence or BAC data as of Septenfiber 2000, 
raising the possibDity that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the imas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55d) 
using the same method: We found that 32,371 
markers (88%) were located in the mapped • 
CSA scaffolds, %vith 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genofhe- 
wide survey. 

Correctness. Correctness is defmed as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
-Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 

-TabU 4. Summary of iscaffold mapping. Scaffolds' 
v/ere mapped to the genome with different leveU 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 

-Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99. or 
component tiling path. Bounded scaffolds had or- 
der conflicts be^veen at least two of the external 
maps, but their placements v/ere adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 
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Oriented 
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38,241 


368.753,463' 


14 
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274,536.424 


10 


Unoriented 
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scmbly against other finished sequence for 
detcnnining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymoiphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 

. can be measured by mate-pair analysis. In a 
correct assembly, eveiy . mated pair of se-. 

* quencing reads should be located on the con- 
sensus sequence with the correct separation . 

*. and orientation between the pairs. A pair is 
tenncd 'Valid" when* the reads are in* the , 
correct orientation, and the distance between 
them is within the mean ± '3 standard devi- • 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A • 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is tenned "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined ■ as described 

. above. To validate these, we examined all 
reads mapped to the finished sequence of : 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- ^ 
merism (two different segments of the gc- 1 
nome cloned into the same plasmid), and how r. 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp bl)raries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 

:('-10%).-Thus, although the mate-pair infor- 
mation was not perfect, its accuracy was such 

: that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 

.for validation purposes, especially when sev- 

. era! mate pairs confirm or deny an ordering. 
The .clone coverage of the genome was 

■ 39X, meaniiig that any given base.pair was,; 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would . 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 

•regions of less than 3 X clone coverage. Thus, 
more than 99% of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 
. We examined the locations and number of 

"all misoriented and naisseparated mates. In 

• addition to doing this analysis on the CSA 

• assembly (as of 1 October 2000), we also 
performed a study of the PFP assembly as of 



5 September 2000 (30, 55b). In this latter 
case; Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to hijgh-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five .or more simultaneously invalid 
mate pairs mdicated. a potential breakpoint, 
where the construction of the two assemblies 
differed. The grapluc comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate brea3q)bints. There were a 
similar (small) number of breakpoints on 

.'both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped r^iably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the nvo 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
:the large-insert libraries (50 kbp and BAC 

. ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. The graphic . comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered Invalid (number of mvalid mate 
pairs). . . 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST e\ndence. This resulted in a 
series of *'gene bins/' each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 



being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gerie set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not " 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts . 
predicted in this way. 

Regions that have a substantial amount of 
seiquence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem7 
bly. The PF.P^assembly is indicated In the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In iKe center ofthe panel; green lines 
show Celera seqi^en^es that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of seqaences. Yellow lines 
indicate sequence blocks that are In the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only 6rav^n 
betv/een segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
{red, misoriented; yellov/. incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are v/ithin the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shov/n as blue ticks on each assembly 
axis. Runs of more than 10.000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at vvww.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- . 
tion between mouse and human genomic 
DNA, siinilarity to human transcripts (ESTs 



and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
- extracted, and the subsequences supported by 
any homology evddence were marked (plus lOp 



Si 



15 "" r 



- e 



i ii - ii =i Is 
ii 



i= z 

ii 



C4 to 
lA d 



■ Ha = 



S 

= 1= r" S« s= 



EST ::s 
51 

11 



in! 11=1 Pii - 



CO 



O 
CI 



ii-i 

11 1 



II s ii i. 
ii _= • Ii 

is is ~~ 



Ii • 

B= — 

§ 



gs = is -s gi 5 ii — ii 

K-rn-^ 

il ii = 1= ^ ^1 - - ii ii ^= IE 



if:: ii-; 

E3 — 



!i=i ii-: 



si 



i=r,_ §155 




15 |g ^_ 

■J _ — B 79 — 

ii = 11 = 

bI-- li-g 



S2 ~ E= « — 



5S _^ 
si • Z 



'li'-s-ii 



IB 



5S «E 



Si 



E 



o 



S! ? 

in 



ii^i if-:- 



33 = 



Si — 



" II = 



ii"- ii 1 

ii' = |i 

H§ i-^ Him 



— SS Sot fct=: ^ 



ii-i ll-i 



— 

»*= — 

5 8 



S 9 8 
S 



5= == BS 
B= -Z S= 

sir- ills 



ii := 



ii =r 
ii -i 

ii _i 



o 



BE £ 

as =s 

IS =r 
gg =E 

ii-^" i= - 

is 1- ""^ 

if Hi 



11 If ii-f 



SI =1 1= ~Z 



CO 



Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Cetera s 
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represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome Is Indicated In black, and the chromosome numbers In red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced-by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N*s, 
was then evaluated by Geoscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
. by first establishing the boundary for the gene 
(not a strength of most gene-firiding^ algo- 
rithms), and by eliminating rc^ons with ho 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence, A weakness of using Genscan to 
icfine the gene model is the loss of valid, small 
exons from the final annotalioa 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was ii3.ed in previous steps 
to evaluate the depth of e%ndence for each exon 
in the prediction. Intemal'exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons; the intemal 
edge was required to be within 10 bascs;but the • 
external edge was allowed greater latitude .to 
allow for 5' and -3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
' the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene; we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria wcre^^isfegarde'd.'and 

Table '7. Sensitivity 'and specificity of Otto and 
Genscan. Seniltivity and specificity were calculat- 
ed by first aligning the predtctionto the published 
RefSeq transcript,- tallying the number (N) of 
• uniquely aligned RefSeq bases.' Sensitivity Is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signify 
leant (Tukey HSD; P < 0.001). 

Method • . -Sensitivity Specificity- 



Otto (RefSeq only)* 0.939 0.973 
Otto (homology)! 0.604 0.884 

Genscan 0.501 0.633 

•Refers to those annotatlonj pro<fuced by Otto using only 
the 5Im4-po«shcd RefSeq alignment rather than an evi- 
dence-based Genscan prediction. fRefers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAE^ Genscan, and FgenesH 
(63)1 were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11^26 additional genes by 
means of sequence sinulaiity. • , 

3«2 Otto validation 

To validate the Otto homology-based process 
.and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 45 12 RefSeq transcripts for which there 

• was a luiique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensidviiy (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 

• bed the sensitivity and specificity of the Otto 
. predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto , 
uses to annotate known genes (Ottb-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions conresponding" to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6.1% of true RefSeq 
nucleotides were not represented in the Otto 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- ' 
tained in the original RefSeq transcripts- The 
discrepancies could come from legitimate 

' differfcnt'es- between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set->used for the comparisons. . 

Because Otto uses an 'evidence-based ap^ 
proach to reconstruct .genes;,the absence of 
experimental evidence for intervening exons 
may inadvertantly result iri a set of exons that 
'i'.' cannot be spliced together to give rise to a 
transcript In such cases. Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
dictioii strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene, predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragrnent matches. This fmal class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there. was not sufficient 
sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overiap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement, ^s seen in Table 8, if the re- 
quirement for other. supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to --23.000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently xmde- 
'scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol-. 
lowing evidence/types— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs— or similarity to a known protem 
reduced this number to 1010. Adding this to 
the numbers firom the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending oii the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree ot 



1320 



16 FEBRUARY 2001 VOL 291 SCIENCE www3ciencemag.org 



THE Human genome 



confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set . 
includes the 6538 genes predicted by Otto on 
the basis of matches to known t'enes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 ftom the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
.chromosome diagrams in Fig: :1. These ai'e a 
■'•vety preliminary set of aniiotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. AH the predictions and . 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amoimt of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typf- 
cal'* gene in the human DNA sequence to 
be about 27,894 bases. This is based on.th^' 
average span covered by-RefSeq tfan- • 
scripts, used because it represents our high- • 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amovmts of evidence that sup- 



port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.8 1 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 
. Summary, This section describes several of 
the honcoding attributes of -the assembled ; 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 



4.1 Cytogenetic maps 
Perhaps the most obvious, and certainly the 
:.-most^visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochroniatin. (tff). Much of this hetero-: 
.* . chromatin is highly polymorphic and con- ' 
'.sists 'olf different families 'of alpha' satellite 
DNAs with various higher order repeat 
structures (j65). Many chromosomes have 
complex inter- and intrachromosomal du- 
.plications present in pericentromeric re- 
gions {66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Slm4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Censcan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data.show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated '\vith' a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here Indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of Insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets 'analyzed In this paper (boldface^set of genes selected for protein' analysis; Italic, total set of accepted de novo predictions). 
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127.955 
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Number of 


58,032 


14;463 


5.094 


8.043 


9.220 


21,350 


8,619 


4,947 
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319,935 


48,594 


19.344 


26.264 


. 40,104 


79.148 . 


31,130 


17,508 




exons 
















4.28 


No. of exons per 


Otto 


7.84 


5.77 


6.01 


6.99 


7.24 


7.81 


7.19 


6.00 


transcript 


De novo 


5.53 


3.17 


3.80 


^ 3.27 


4.36 


3.7 


3.56 


3.42 


3.16 



•four kinds of evidence (conaervatlon In 3X mouse genomic DNA, similarity to human EST or cDNA. similarity to rodent EST or cDNA. and slmlUrity to known proteins) were 
considered to support gene predictions from the different methods. The use of evidence Is quite liberal requiring only a partial match to a single exon of predicted transcript fThls 
number Indudes attemative spUce forms of the 1 7,764 genes mentioned elsewhere In the text 
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Examination of pericentromeric regions is' 
ongoing. 

The remaining -^-80% of the genome, the 
euchromatic component, is divisible into 

and T-bands (67), These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to detennine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
,G+C-poor (68). Bemardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, temied isochores (denoted L, HI, 
H2, and H3), which are >300 kbp.in length 
(69). Beraardi defined the L G^ght) isochores as . 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene . 
. concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70), By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of • 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isbchores) aver- 
aged 202.8 kbp m lengtli, and the average span 
of regions with <43% (L isochores) ' was 
,1078.6 kbp. The correlation between \G+C 
content and gene density, was also examined in . • 
50-kbp windows along the assembled sequence 
(Table 9 and Figs, 10 and 1 1). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69), A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest- gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
1 8, 1 3, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, wliich we found to have a low gene 
density, does not appear to be imusual in its 
H3 banding. 

. -. How. valid is Ohno's postulate (7 J) that . 
• mammalian genomes consist of oases of genes 
in otherwise essentially cnipty deserts? It ap- 
. pears that the human genome does indeed con- ^ 
. tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
. g^ne, then we see that 605 Mbp; or about 20% 
• of the. 'genome, is in* deserts. These are not 
imiformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, and Xhave 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
= analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- . 
ihg of genes. The distance metric, centimorgans 
(cM), 'is based on the recombination rate be- . 
tween homologous chromosomes during meio- 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
. to produce the Ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely .used in genome 
. and genetic aiialysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We. mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3-Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously, documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates ai^ the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
. combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of variability in recombination 
rate will depend -on the size of the window 



Table 9., Characteristics of G+C In Isochores. 


Isochore 


G+C (%) 


Fraction of genome 


Fraction of genes 


Predicted* 


. Observed 


Predicted* 


Observed 


H3 

H1/H2 ... 
L 


>48 
43-48 
<43 


5 
25 
67 


9.5 
21.2 
' '* 69.2 


37 
32 
31 


24.8 
26.6 
48.5 



*The predictions wece, based on Bemardi's definitions (70) of the iso^ore structure of the human genome. 



^ Fig. 9. Comparison of 
the number of exoni 
per transcript between. , 
the 17.968 Otto ti^- 
scripts and 21350 de 
novo transaipt predic- 
tioris with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both . 
sets have the highest 
number of transcripts ' 
In the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set. 

of the tran- 
scripts have one or 
two exons« and 5,7% 



7ooor 




@ No.ofOnb 
transcripts 

m No.cfde novo + 
1 line of evidence 



h Si flL ^ FL 

I I i I I 1 I ; I f 



8 9 10 11 12 13 14 15 16' 17 18 19 20 >20 
Number of exons per transcript 
have more than 20. In the de novo set, 493% of the transcripts have one or tvra exons, and 0.2% have more than 20. 



322 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencenDag,org 



THE HUMAN GENOME 



examined. Unfortunately, too few raeiotic 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
fmer than about 3 Mbp. The next challenge 
will be to determine a sequeifbe basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
"such as in positional cloning projects. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of immethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome {74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76), In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting {78) and 
tissue-specific gene expression {79) 

Experimental methods' have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
{74, 80) and an estimate of 499 CpG islands' 
on. human chromosome 22 {81), Larsen cr' 
at. {76) and Gardiner-Garden and Frommer' 
{75) used a computational method to iden- , 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide S:0.6. 

It is difficult to make a direct compari- 
.. son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation slate of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island . 
/with gene., start's, given a set of annotated » 
genomic transcripts and the whole genome 
sequence. Wc have analyzed the publicly, 
. available annotation of chromosome 22, as . 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land . computation was compared with 
Larsen et aL {76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and wc recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results' are sum- 
.-marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the : 
CS A sequence as CpG, but 40% of the gene 
starts '(start codons) are contained inside. a * 
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Fig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
genome (in 50-kbp windows) with the Indicated G+C content The percent of the total number of 
genes associated with each G+C bin Is represented by the yellow bars. The graph shows that about 
5% of the genome has a G-fC content of between 50 and 55%, but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others {82). The last two rows of 
the table show the observed and expected 
average distance,- respectively; of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. , . 
. .We also looked at the distribution of CpG 
island nucleptideV among- various sequence : 
classes such as intergenic regions, introns, 
cxons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
introo, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide anal5^sis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 
The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We.observed about 35% of 

the genome in these repeat classes, very sim- 
ilar to values reported previously (55). Repet- 
itive sequence may *be underrepresented in 
the Celera assembly as a result of incomplete 
: repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and wc 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and genej density, which was not observed 
between LiNEs and gene density. 

5 Genome Evolution 

• • - ». 

Summary, The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11. Genome structural features. 



1324 



16 FEBRUARY 2001 VOL 291 SCIENCE www^ciencemag.org 



THE HUMAN GENOME 





(0 




o o 4J 

CD O C 

:3 M c 

r-l W . O O 

■ ^ pa Ai o 



> 




-ig. n (continued). Relation among gene density (orange). C+C contents- 
green). EST density (blue), and Alu density (pink) along the lengths of 
jach of the chromosomes. Gene density was calculated In 1-Mbp win>*-^ 



dows. The percent of G+C nucleotides was calculated In 100-kbp 
Windows. The number of ESTs and Alu elements Is shown per 100-kbp 
window. 



>.1 Retrotransposition in the human 
genome 

Ictrolransposition of processed inRNA 
Tanscripts into the genome results in func- 
ional genes^ called intronless paralogs, or 
nactivated genes (pseudogenes). A paralog 
efers to a gene that appears in more than 
)ne copy in a given organism as a result of 



a duplication event. The existence Qf-bbth 
introh-containing and intronless forms of 
genes., encoding functionally similar or 
identical' proteins has been previously de- 
scribed.. (5^, 85), Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplicati<Jri 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Orto-prcdictcd, singlc-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances "of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
fulMength genes at the stringency specified . 
. and were verified by rnanual inspection. ; . 
. ;AVe believe, that these .97' cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.scienccinag.org^cgi/ 
content/fuliy291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be deteraiined. All ' 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the. phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84^ 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather,- -the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sey- 
eral cases of retrotransposition from a single' ' 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the- 
retrotransposltion of a five exoo-cpntainlng 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the SIrexon diacylglycerol kinase zeta gene* 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncocling regions 
that lead to different fiinctions or expression 
patterns, represents a key route to providing 
an enhanced fimctional repertoire in mam- 
mals (57). 

Our prelLmina:Qr'sct of rctrotrahsposed in- 
tronless paralogs' contains a clear oyerreprer - - 
sentatiori of genes involved in translational 
processes (40% ribo'spmal" proteins and. 10% 
translation elongation factors) and 'nuclear . 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory' en2ymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream- 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 



5.2 Pseudogenes 

A pseudogene is 'a nonfunctional copy that is 
very similar to a nomial gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. ' • 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genocne (including gaps) 

Size of the genome (excluding gaps) . 

Longest contig * .* **...• ' ; .. ' • . • - '^ 

longest scaffold ' . **. : c , , - • ' ' 

Percent of A+T In the genome • • •* •*■ "** *' 

Percent of C+C in the genome 

Percent of undetemimed bases In the genome 

Most GC-rich 50 kb 

Least GC-rich 50 kb 
•percent of genome classified as repeats 
. Number of annotated genes 

Percent of annotated genes with unknown function 

Number of genes (hypothetical and annotated) 

Percent of hypothetical and annotated genes with unknown function 

Gene with the most exons 

Average gene size 

Most gene-rich chromosome 

Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by Introns 
Percent of base pairs In intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome v/ith lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 

"•in these ranges, the percentages correspond to the annotated gene set (26. 383 genes) and the hypothetical + 
annotatid gene set (39.114 genes), respectively. . . 

Table 12. Rate of recombination per physical distance (cM/Mb) across the genome; Cenethon markers 
v/ere placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
In 3-Mb windows for each chromosome. NA, not applicable. 



2,91 Gbp 
2.66 .Gbp 

1.99 Mbp • ; *. • 
. 14.4 Mbp . 
• 54' * ' ■*' ■ *■ 

38 

9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26.383 
42 

39,114 
59 

• Titin (234 exoas) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb). 
Chr. Y (5 genes/Mb) 
605 Mbp ^ 
25.5 to 37.8* 
-1.1 to 1.4* 

24.4 to 36.4* N 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (036) 

Chr. 13 (3.038.416 bp) 
1/1250 bp 



Male 



Sex-average 



Female 



Chrom. 





Max. 


Avg. 


Min. 


Max. 


- . Avg. 


Min. 


Max. 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


231 


1.42 


0.52 


339 


1.76 


0.68 


2 


. .2.23 


0.78 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 ' 


.0.42 


2.71 


130 


0.33 


4 


* ' 1.66 


0.67 


0.15 


2.06 


1.04 . 


. aeo 


2.50 


1.40 


0.77 


5 


2.00 " 


0.67 


0.18 


137 


1.08 


0.42 


2.26 


1.43 


0,62 


6 


1.97 . 
••^34 - 


6.71 " 


0.28 


2.57 


1.12 


037 


3.47 


1.67 


0.64 


7 


1.16 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


034 


8 


. 1.83 


0.73 


0.14 


2.40 


• 1.05 


0.46 


3.44 . 


• 136 • 


0.43 


9 


2.01 


039 


0.53 


135 • 


132 


0.77 


2.63 


. *1.66 


0.82 


•10' . 


3.73 


1.03 '. 


0.22 


3.05 


1^9 


. 0.66 


234 


1.51 


0.76 


11 


1.43 . 


0.72 


031 


2.13 


039 


0.47 


3.10 


132 


0.49 


12 


4.12 • ' 


0.76 


0.26 


335 


1.16 . 


0.49 


233 


135 


039 


13 


1.60 


0.75 


0.01 


137 


035 


0.17 


2.49 


1.19 


032 


14 




0.98 


0.18 


2.65 


/ 130 


0.62 


3.14 


1.63 


0.75 


15 


2.28 


034 


0J4 


231 


1.22 


0.42 


2.53 


1.55 


0.54 


16 


1.83 


1.00 


. 0.47; 


■Z70 


135 


0.63 


4.99 


232 


1.12 


17. 

18 . 


3.87 


0.87 


0.00 


334 


135 


0.54 


4.19 


1.83 


034 


3.12 


137 


036 


3.75 


1.66 


0.43 


435 


2.24 


0.72 




3.02 


037 


0.10 


237 


1.41 


0.49 


. 239 


1.75 


0.87 


20 ' 


. 3.64 


039 


0.00 


2.79 


1.50 


0.83 


331 


2.15 • 


134 


21 . 


3.23 


1.26 


0.69 


237 


1.62 . 


. 1.08 


2.58 


1.90 


1.18 


22 


1.25 


1.10 


0.84 


1.88 


1.41 


1.08 


3.73 


2.08 


0.93 


X 


NA 


NA 


NA 


NA 


NA 


NA 


3.12 


1.64 


0.72 


Y 


NA 


NA 


NA . 


. NA 


NA 


NA 


NA 


NA 


NA 


Genome 


4.12 


038 


0.00 


3.75' 


1.22 


0.17 


439 


135 


032 
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. that account for gene inactivatidn. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the. 
functional countezparts^ a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of fetrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the gehomlc , se- 
quence by means of BLAST. Genomic re- 
gions coiresponding to all Otto-predicted ., 
transcripts y/erc excluded from this analysis. - 
We identified 2909 re^ons matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used 

. We looked for correlations between 
structural elements and the propensity, for 
retrotransposition-in "the human genome. 
GC content and transcript length were com- 
pared between the genes with processed" 



pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
t content did not show any significant differ- ' 
. cnce, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
• fetrotransposition (both intronless paralogs • 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene dupUcation in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo-. 

. ritlun, called Lek, for grouping the predicted . 

■ human protein set into protein families (89). 



Table 13. Characteristics of CpG Islands identified in chromosome 22,{34-Mbp sequence length) and the 
whole genome (2.9-Cbp sequence length) by means. of two different methods. Method 1 uses a CG 
Ukelihood ratb of &p.6. Method 2 useis a CC likelihood ratio of ^0.8! 



Chromosome 22 



Whole genome. 
(CS assembly) 





Method 1 


'Method 2 


Method 1 


Method 2 


Number of CpG Islands 


5.211 


522 


195.706 


26,876 


detected 










Average length of Island (bp) 


390 


535 


395 


497 


Percent of sequence — - 


5S 


0.8 


2.6 


0.4 


predicted as CpG 










Percent of first exons that 


44 


"'25 


42 


22 


overlap a CpG Island 










Percent of first exons with 


37 


22 


40 


21 


first position of exon 










contalried Irislde a CpG. .. « 










Island- • 










Average distance bejtween 


'1:o^3 


10.486 


2.182 


17.021 


first exon .and^<;rosest CpG . 










bland (bpjr." * 










Expected distance between 


3.262 . 


32.567 - ' , 


7,164 


55.811 


first exon and closest CpG 
Island (bp) 



















Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assernbly 'sequence. 



Repetitive elements 


• Mejabases In . 

• • assembled 
. . sequences 


Percent 
assembly 


Previously 
predicted 
(%) (83) 


Alu 


288 


9.9 


10.0 


Mammalian Interspersed repeat (MIR) 


66 


23 


1.7 


Medium reiteration (MER) 


50 . 


1.7 


1.6 


Long tenmfnal repeat (LTR) 


155 


53 


5.6 


Long Interspersed nucleotide clement 


466 


16.1 


16.7 


(LINE) 








Total 


1025 


353 


35.6 



-Jhe complete clusters that result from the 
tek clustering provide one basis for compar- 
ing the .role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means^ such as tandem du- 
plication. Because each complete cluster rep- 
■ resents a closed and certain island of homol- 
ogy* and because Lek is capable of simulta- 
neously clustering protein complements of 
• several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
' lsm*$ contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
: ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presxmnably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with £>. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1 : 1 in the ratio for 
human-worm or human-fly clusters with the 
' . slope spreadL covering both human and fly/ 
. worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as" many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the himian protein set. However, 
' in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5,4 Large-scale duplications 
Using two independent methods, • we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly • 
conserved blocks of duplication. We then 
: describe our comprehensive method for identi- . 
fying all intcrchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and the 
same complete Lck cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
• ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and iek* complete 'cluster (89). All 
•pairs, of. indexed gene .strings . were then 
aligned in both the forward jind reverse di- 
rections vnih the SmithAVateraian algorithm 
(90), A niatcb between two proteins of the 
same Lck complete cluster was given a score 
of 10 and a mismatch —10, with gap open 
and extend penalties of —4 and —1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chrocdosomes directly 
with one another using an algorithm based on 
the MUMmer system (9J), This alignment 
method uses a suffix tree data structure and a^ 
linear-time algorithm to align long sequences , 
very rapidly; for example, tvvo chromosomes, 
of 100 Mbp can be aligned in less than 20 ' 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that, 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 vecy 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human 'genome, DNA 
alignments at the whole-chromosome level 
arc insufficiently sensitive. Therefore, a mod- 
ified procedure developed and applied, 
as-* follows. First, lall 26,588 • pfpYeins . . 
(9,675,713 million amino acids) were concat- 
enated end-to-end in;.Orcfer as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set Was then aligned against cdch chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that - 
occur in close proximity on two different 
chromosomes (P J);- these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
p\c, small blocks that were spread across 
many proteins were removed. To refmc the 



filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular,: every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
• real and the shuffled data," with the results on 
the shuffled data being used to estimate the . 
false-positive rate. The algorithrn af^er filter- 
ing yielded 10,310 gene pairs in 1077 dupli-. • 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- . 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. .In 
many cases, the order of the proteins has been 
shuffled, although proxinuty Is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 . 
contain five or more genes. 

To illustrate the extent of the detected 
duplications. Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are .ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-mahy dupli- 
cation relationships that are graphically strik-' 
ing. One such example captured by the anal- : . 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks . 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



700 1 



tions at several evolutionary stages (94), The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
sdgments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 

. The proteins are not contiguous but span a 
region' containing 97 proteins *o*n chromo- 
. • some 2 and 332 proteins on chromosome 14. 

. The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X IQ-^* (93). This dupU- 

• cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 

. large, which is shared by chromosome arm 2q 
.and chromosome 12. Tliis duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also see'n on the two chromo- 
somes carrying the other two Hox clusters. 
. . An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
coiTimon to many of the other observed large 

• duplications (Fig. 13, inset): This duplication 

• contains 64 detected ordered intrachforno- 

• somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 

<free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup reV* and "collagen 

. rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fte 12. Gene duplication In complete protein clusters. The predicted protein sets of human, worm. 
arS fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By tliis measure, the duplication segment 
spans nearly half of cach=^t:hromosome's net 
length. The most likely scenairio is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
. relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein : 
pairs in this alignment occur among 217 pro- 
tein assigrunents on chromosome 18, and 
among 322 protein assignments on chromo- 
' some 20, for a density of involved proteins of • 
20 to 30%.-This is consistent with an ancient • 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than SOYo gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As* an independent verification • 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
. the pairs of aligning proteTns in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small familics.of para- * 
logs; their relative scarcity within the genome ; 
validates the uniqueness and robust nature of 
their aligiunents. 

Two additional qualitative features were ob-. 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assigrunents, are members of 
duplicated segnients (see web table 2 on Sci- 
ence Onlirie at www.scierfcemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genefi are proteins involves! in hemostasis 
(coagulation factors) that are -associated "ivith 
bleeding . disorders, /transcriptional regulators 
like the horne.obo;^ proteins associated with de- 
velopmental disorders, and potassium chaiuiels 
associated with cardiovascular conduction ab- 
normalities. For each of thefse disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed, to determine whether they might be 
involved in the same or similar genetic diseases.'. 
Second, although there is a conserved number, 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 aligrunent, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
conrcsponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



-pair of duplicated chromosome regjons was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the aHgnment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
.. dupUcation in fact best explains many of the 
blocks detected by this gehome-wide analysis. 
The regions, of human chromosomes involved 
in the large-scale, duplications expanded upon 
. above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse*. 
• . .chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
. their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions . 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 

• the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 

. tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though fiirther detailed analysis must be carried 
out once a more complete genome is assembled, 
for mouse, the underlying large duplications . 

• appear .to predate the two species* divergence. 
. TTiis dates the duplications, at the latest, before 
; divergence of the primate and rodent lineages. 
This date can be further refined upon examina- . 
tion of the synteny between human chromo- 
sorhes and those of chicken, pufferfish (Fugu 
rubripes), or zebra fish (Pi). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human, 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- . 
cd,-with further mapping, the ages of the 
nearly chrornosome-length duplications seen • 
in humans are likely to be dated to the root of 
vertebrale diverjgence. 

The MUMrner-based results demonstrate 
large block duplications that range in size irpm 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the rtumerous duplicated 
regions (96), The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



"veal the stagewise history of our genome, and 
with it a history of the emergence of mJxy 
the key functions that distinguish us from other 
living things. 

6 A Genome-Wide Examination of " 
Sequence Variations 

Summary, Computational methods were used 
. to identify, single-nucleotide polymoiphismi 
- . . (SNPs) by comparison of the Celera sequence 
. to othei: SNP resources. The SNP rate be. 
tween two chromosomes was —1 per 1200 lo 
1500 bp. SNPs are distributed nonrandomly 
..throughout the genome. Only a very small 
proportion of all SNPs (<1%) polcntiall)- 
impact protein function based . on the func* 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an cs- 
. timate that only thousands, not millions, of 
genetic variations may contribute to the struc* 
tural diversity of human prpteins. 

Having a complete genome sequence cnalilcj 
researchers to achieve a dramatic accclcntiion 
in the rate of gene discovery, but only ihrouith 
analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in hcahlj 
among human beings. Whole-genome shotgun 
. sequencing is a particularly effective, mclhtx! . 
. for detecting sequence variation in tandern with 
\yhole-genbme assembly. In addition, we com- 
• pared the.: distribution and attributes of SNIV " 
ascertained by three other methods: (i) alitin- 
ment of the Celera consensus sequence lo the 
PFP assembly, (ii) overiap of high-qualily rends 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (97), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
"TSC"; 632,640 SNPs) (98), These data were 
consistent in showing an overall nucleotide di- 
versity of -8 X 10-^ marked heterogeneity 
across the genome in SNP density, and nn 
-overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 
Ideaily, methods ofSNP discoveiy make full 
use of sequence depth and quality at every site, 
and quantitatively control the rale of false-pos- 
itive and false-negarive calls with an explicit 
sampling model (99), Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality sc<ircs 
could not readily be obtained for the PH 
sembly). First, all sequence differences between 
'the tvvo consensus sequences were ^^cntiiito, 
these were then filtered to reduce the contno^^ 
tion of sequencing en-ors and 
a measure of the effectiveness of the filten t 
step, wc monitored the ratio of *^!j'^"^,|o 
transversion substitutions, because ^ V 
has been weU documented as typical m i^- ^ 
malian evolution (JOO) and in human 
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{101, 202). The filtering steps consisted of re- 
moving variants where the quality score in the 
Cclcra consensus was less than 30 and where 
tfie density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
ton-to-transvcrsion ratio from A ^7- 1 ■ to 
1.89:1. When applied to 23 Gbp. of alignments 
between the Celcra and PPP consensus se- 
quences, these filters resulted in identification 
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to their being the smaUest twp sets. In addition 
24.5?^ of the Celeia-PfP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46), SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro-' 
vide an efficient initial vaUdation "in silico" (by 
computational analysis). 

One means 6f zssess'iDg whether the 



K-Kv^^n AJc oxr„7j . -yverlaps. .-Of human variation is tQ.tally t}ie frequeh-" 



l>ehveen tius set of SNPs and those found by 
other methods are described below, " ' " 



6,2 Comparisons to public SNP 
databases 

ill'l'^i''"^^ including 2,536,021 from 

dbSNP (wwv.ncbi.nlra.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBIast (103), The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47^0 and 25% of the dbSNP 
records. Low-quaHty alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded A total of 
2,336,935 dbSNP variants were mapped to 
1^23,038 unique locations on the Celera se- 
quence, implying considerable redundancy in* 
dbSNP. SNPs in .the TSC set mapped to 
585,81 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PfP TSC 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial faction of SNPs identified by one of 
Jiese methods was also found by another meth- 
od. The very high overlap (36.2%) be^veen the 
<Cwok and Celera-PPP SNPs may be due in part 
o the use by Kwok of sequences that went into 
he PPP assembly. The unusually low bveriap 
J6.4%) between the Kwok and TSC sets is due 



cies of : the six possiblil. base: changes in 
each set of SNPs (table: 16); Previous mea- * 
sures of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (JOJ), znd our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale.. 
There is remarkable .homogeneity, between 
-the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 trans ition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 transitionrtransversion ratio for the 
bona fide SNPs would be obtained if one 
.'assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
: (presumably random) sequence errors. 



NP databases. Table entries are SNP counti for 
ach pair of data sets. Numbers In parentheses are 
le fraction of overlap, catculated as the count of 
i^edapp.ng SNPs divided by the number Sf SNPs 
'J^l\l!!^ 'J l^^ databases compared. 
-D I .oL'^^'""^ databases are: Celera- 

•P. 2,104,820; TSQ 585.811; and Kwok 438.032: 

et i:z\r' -^-^ 



6.3 Estimation of nucleotide cliversity 
from ascertained SNP5 
The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used 77, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure, of 
per-site . heterozygosity, . quantifying the 
probability that a' pair of chromosomes- 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for .each chromosome, we 
iieed to know the" numljcr of nucleotide 
sites, that were surveyed for variation, and 
in methods like reduced resp^esentation se- 
quencing, we need to know the sequence 
quality and the depth*, of coverage at each 



site. These data are not readily available so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity, from high-quality sequence 
overlaps should be possible, but again 
more information is needed on the details 
of all the alignments. 

• Estimation of nucleotide diversity from a 
•shotgun assembly entails calculating for each 
; • column of the multiaUgnm.ent; the probability 
that tvvo or:more distinct alleles are present 
and the probability of defecting a 'SNP if in 
■ fact the alleles have different sequence (i e 
the probability of conrect sequence calls). The 
greater the depth of coVerage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (JOS), Even 
after conrecting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heterot^e- 
neity was tested by analysis of variance with 
estimates of 17 for 100-kbp windows to esti- 
mate variabiUty within chromosomes (for the 
Celera-PFP comparison, F = 29.73 P < 
0.0001). J 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X IQ-^ Nucleotide diversity on 
the X chromosome was 6.54 X I0~^ The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller cf- " 
fective population size means thai random ' 
drift will more rapidly remove variation 
from the X (106). / ' ' 

Having .ascertamed nucleotide variation 
-genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (JOI, 102, 106, 107), Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10-» for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenccd human genes was 
8.00 X lO--* iI08). 

6.4 Variation In nucleotide diversity 
across the human genome 
Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het-. 
crogeneity at a fmer scale within chromo- ' 



Table 16. Summary of nucleotide changes in different SNP dita sets. 
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Fig. 13. Segmental duplica- 
tions between chromo- 
somes In the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing IO310 
pains of genes In total Each 
One represents a pair of ho- 
. niolpgous genes belonging 
to a block; all blocks con- 
tain at least three genes 
oh each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a' 
* single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel Is shown as a 
thick red Une for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The Inset (bot- 
tom, center right) shows a 
close-up of one duplica- * 
tion between chromo- - ^ 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the '64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than , expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. * 
14). However, tliis simplistic model ignores, 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds . 
that we can account for this variation with a • 
mathematical formulation called the neutral •. 
coalescent (209). Appljdng well-tested algo-. 
rithms for simulating the neutral coalescent : 
with recombination (J JO), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (IIJ), we generated a distribution of num- 
bere of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- . 
cr variance than cither the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an e;^lanati6n: 

. Several attributes of the DNA* sequence 
may affect the local density of SNPs, in- ' 
eluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch: .. 
repair. One key factor that is likely to be • 
associated with SNP density is the G+C . 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 1 0-fold increase in the 
mutation. rate of CpGs over other dinucle-. . 
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. otide's. We tallied the GC content and nu-. 
cleotide diversities in lOO-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 

. but G+C content accounted for only a 
small part of the variation. . 

,6.5 SNPs by genomic class 
• To test . homogeneity of SNP * densities 
• across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic .(missense and silent), in- . 
tronic, • and 3MJTR for 10,239 -known ., 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent> for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
..tent with the elimination by natural selec- 
tion of a frraction of the deleterious amino 
acid changes (772). These ratios are com-. . 
parable to the missense-tp-silent ratios of 
0.88 and 1.17,found by Cargill et aL {101) 
and by Halushka e/ ql. (7(72). Similar re- : 
suits were observed in SNPs derived from 
Cdlera shotgun sequences (46), 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Rcf: 
Seq genes, missense SNPs were only aboiit • 
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Fig. 14. SNP density In each 100-kbp Interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red. Poisson distribution. 
The figure shows that the distribution of SNPs along the genome Is nonrandom and Is not entirely 
accounted for by a coalescent model of regional history. 



O.I2, 0.14. . and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Non conservative pro- 
tein changes constitute an even smaller fiac- 
tion of missense SNPs (47, 41, and 40% in 
. Celera-PPP, Kwok, and TSQ. Intergenic re- 
gions have been virtually unstudied (i/i), and 

• we note that 75% of the SNPs we identified 
were intergenic (Table 17); The SNP rate was. 
highest in introns and lowest in exonsi The SNP 
.rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confimied in the Celera SNPs, which 

. also exhibited a lower rate in exons than in 
■ ■ introns,- and in extragenic regions than in in- 
. trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 

• znaikers for linkage and association studies, and 
some firaction is likely to have a regulatory 

. function as well. 

7 An Overview of the Predicted 
Protein -Co ding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences . and ■ similarities 
when the human genome is compared with 
other fully. sequenced eukaryotic genomes. 
Over 40%-of. the predicted protein set in 
humans cannot be ascribed a molecular 
•function by methods that assign proteins to 
known families. A protein domain- based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
.worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskelelal complexity. The fmal enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. .:5 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116), 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are preU / 
nary and are subject to several limitr / 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive^ 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will hot be coniputa- 
tiorially predicted). We also, expect enrors in " 
delimiting .the boundaries of exons and genes. 
Similarly, In the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that arc not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken. from the set of. 
26,588 predicted proteins, which were assigned 
fiinctions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene - 
products, and how are these proteins cate- ^' 
gorized with current classification meth- 
ods?p (ii) .What are the core functions that/- 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 

eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 

26,588 human proteins that have .at. .least 
two lines of supporting "evidence. About . 
41% (12,809) of the; gene products .dould ■ 
not be classified from this initial analysis 
and are . terrned proteins .with \iinknown . 
functions. Because oiir automatic classifi- . 
cation methods .'treat only relatively large . 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad . 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functloQal' predictions 
are based on similarity to sequences of ' 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putadvc genes were 
•assigned molecular functions by the automated 
mfethods; ' One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, . further suggesting that the majority of . 



these unknown-function genes are not real 
genes. Given that most of these additional 
, 12,095 genes appear to be unique among the 
. . genomes sequenced to date/many may simply 

• represent false-positive gene predictions. 

The most conunon molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
♦ Other functions that are highly represented in 
..the human genome are the receptors, kinases,. 

• and hydrolases. Not suiprisihgly,"*rnost of the 
' hydrolases are proteases. There are also many 

proteins that are. members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric OTP-binding proteins (G proteins) and 
- cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and * 
phosphatases. 

Table 17. Distribution oP SNPs fn classes of 
genomic regions. 



Genomic region 
class 


Size of - 
region 
examined 
(Mb) 


' Celera-PFP 
SNP 
density 
(SNP/Mb) 


Intergenic 


2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron .- * ' 


615 


921 


First intron '. 


. 164 . 


808 


■ Exon 


31.' 


. 529 


First exon 


'ID • 


■592 
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Fig. 15. Distribution 
of the ■ molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function ^tegories \n 
the Gene** Ontology 
(GO) [1791 and the 
. Inner : circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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7,2 Evolutionary conservation of core 
processes 

Because of the varioiS ."model organism** 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of .the evolution of the human ge- 
nome, The genomies of S. cerevisiae C*bak-. 
crs* yeast") (IJ8) and two diverse inverte- 
brates, C. elegans (a nematode worm) {119) 
and D. melanogaster (fly) (2(5), as well as the 
first plant genome, A. thaliana, recently com- 
pleted {92), provide a diverse background for 
genome comparisons. 

We enumerated the **strict ortholdgs" con- 
served between human and fly, and between 
human and worm (Fig. ' 16) to address the 
question. What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protem set"), and therefore are likely 
to perform similar. conserved functions in the . 
different organisms. It is-critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two orgaiiisms by descent from, a common 
ancestor) from paralogs (a gene that appears • 
in more than one copy in a given organism by , . 
a duplication event) because paralogs may . 
subsequently diverge in function.* Following 
the yeast-worm ortholog comparison in 



THE HUMAN GENOME 

{120\ we identified two different cases for" 
each pairvvise comparison (human-fly and 
human-womi). The first case was a pair of 
genes, one from each orgam'sm, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no^ 
additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes \yith 
. .. more than one member in either or both of the 
organisms being compared. Chervitz et al, 
{120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
organisms, and. then looked for pairs of genes 
-that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from ' 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig» 16). If the nearest neighbors arc 
not from different organisms, there has been . 
. a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). \Vhen this one-to-one 
correspondence is lost, defming an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- . .. 
tein set, we could not answer this question for .- . 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
wonn (1523 in conmion between these sets). 
We define the evblutionarily conserved set as 
those 1523 human proteins that have strict 
... orthologs' in both .D,\melanogaster and C. 
elegans. \ • ' ' . 

The distribution of the fimctions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
..not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
.15), there are several categories that are over- 
represented in the conserved set by a factor of 
^2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
DNA ligases, DNA- and RNA-pfocessing 
factors, nucleases, and ribosoraal proteins). 
The basic, transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved, among the animals. 
Other enzyme types are also overrepresent- 
ed (transferas,es, oxidoreductases, ligases, 
lyases, and isomerases).' Many of these cn- 



Fig. 16. Functions of putative 
orthologs across vertebrate . 
and Invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved fn a given cat- 
egory of molecular' function. 
"Strict orthologs- are defined 
here as bi-directional BLAST 
best hitsXTSO) such that each • 
orthologous pair (i) has a . 
BIASTP />-V'alafe of rSlQ-"!^. 
(720), and (li) has'a'more sig- 
nificant hlPSJ.? ' score than 
any paralogs* In either orgaa-< 
ism, le^ there has likely been . 
.no duplication subsequent/ to 
spedation that might make 
the orthology ambiguous. This 
measure Is ouite strict and is a 
lower bound on the number of . 
orthologs. . By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worni orthologs (1523 In 
common between these sets). 
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zymcs arc involved in intcnncdiaiy metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly ovcrrep- 
rescnted in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families haye expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresentcd in the con- • 
served set. The major Conserved families are 
• sraall/guanosine triphosphatases (GTPases). 
(especially the Ras-related siipeifamily, in- ' 
eluding ADP ribosylation ' factor) and cell 
cycle regulators (paxticuiarly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last t\vo significantly 
overrepresented categories are proteui trans- 
port and trafiicldng, and chaperones. The . 
most conserved groups in these categories are ■ 
proteins involved in coated vesicle-mediated . 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 famihes]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the- last common 
ancestor of the human, fly, and woifn. As 
staled before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli-. ^ ^ 
cation makes the determination of true or- 
thologs difficult within the members of con 
served protein families. 



7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels>molec- 
ular functions, protein families, and protein 
domains.- ... * . 

Molecular differences can be correlated 
Vrith phenotypic differences to begin to reveal 
the developmental aj)d cellular processe? . that 
are unique to the vertebrates. Tables 18'jand 
1 9 display a comparison among ^11 sequenced 
eukaryotic genomes, oyer /elected protein/ 
domain families (defined ty sequence sjmi- 
larity, e.g., the serine-threonine prolem ki- - 
nases) and superfamilies (defined b/ shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are cither very large or * 
that differ significantly in humans compared 
with the other sequenced cukaryofe genomes. 
V/e have found that the most prominent hu- 
man expansions arc in proteins involved m (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One. of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
. nome is the appearance of genes involved in 
acquired Lmniunity (Tables 18 and 19). This 
is expected, -because the acquired immune 
response is a defense system that only occurs 
.in vertebrates. We observe 22 class* I and 22 
. " class . U major-'-ihistdcorapatibility' complex 
(MHC) antigen genes and 1 14 other immu- 
. noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- • 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to. constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell .adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-alpha helical 
bundle proteins, namely .the cytokines and .; 
chemoklnes. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fiy 
and worm. These include .protein domains 
--• found in the signal transducer and activator of 
^ ^ transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
rnany of the animal -specific protein .domains 
that play a role in innate imniune response, 
such as the Toll receptoris, do noi appear to be 
significantly expanded in the human genome. 
Neural development, structure, and 
. function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein , families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, a^ well as the number 
of -proteins involved directly in neural 
structure and function such, as myelin pro- 
Veins, voltage-gated ion channels, and syn- 
aptic' proteins such as synaptotagmin.' 
These observations correlate weir with the 
. known phenotypic differences .between the . 
nervous sy^tepis of the^e taxa, notably (i) 
the increase in the number and connecHvity 
of neurons; (ii) the increase in number-of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) {/-?/); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically ' 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
■ rnent. Of the extracellular domains that me- 
diate cell adhesion, the coonexin domain- 
containing proteins {122) exist only in hu-' 
mans. These proteins, which arc not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
: basis for electrical coupling.; Pathway find- 
ring by axons and neuronal network forma- 
tion is.mediated through a subset of ephrins * 
- and their cognate receptor tyrosine kinases. = 
that act as positional labels to establish 
topographical projections' (/J?5). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
.. lins and plexins) is that of axonal guidance 
molecules {124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
{225). Notch receptors and ligands play 
; . important roles in glial cell fate determina- 
tion and gliogenesis {126), 

Other human expanded gene families play 
key roles directly in neurar structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca^"*". . sens or (or receptor) during , synaptic 
: vesicle fusion and release {127): Of interest is 
.'. the - increased . co-occurre'nce in humans of 
*'.PDZ and the SH3 domains in neuronal- 
/ specific adaptor molecules; examples include 
. proteins that likely modulate channel activity 
at synaptic junctions (/.?<5).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic huclecltide gated channels), 
the voltage-gated calcium/sodiuin' channel 
family, the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha' subunit family. Voltage-gated 
sodium and potassium charmels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
charmels, they also play a key rple in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in • 
short-term memory. The recent observation 
of a calcium'-regulated association between 
sodium charmels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability {129), 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of proteb 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopoteln are found in the central 
nervous system. Mutations in any of these 
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Accession 
number 



limitations of.Urge.sc^°"sutom"t^c^^^^^^^^ to the 



Donlatn name 



' • Domain description 



H 



W 



. Pf02O39 
Pf00212 
PF00028 
Pf00214 
PfOlllO 
PF01093 
* PF00029 
Pf00976 
PF00473 
Pf00007 
PF00778 
Pf00322 
PF00812 
PF01404 
Pf00167 
. PF01534 
Pf00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
PF02158 
PF06l84 
■ PFO2070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PFCX)103 
PF02208 
PF02404 
PF01034 
PF00020 
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PF01099 
Pf 01160 

PFoono 



AdrehonieduUm 
ANP 
Cadherin 
Calc^CGRPJAPP 
CNTF ' 
*• %-Clusterin ' 
Coniiexin 
ACTH^domaJn 
• CRF 
Cys^knot 
OIX 

Endothelin 
Ephrin • • 
EPh Ibd 
FCF 
Frizzled 
Hormones 
Clypican 
Cranin 
Cuanylin 
Insulin 
IGFBP 
Leptin 
Xlink 
NGF 

Neuregulin 
Hormones 
NMU 
Notch 

Osteopontin 
Hormone3 
Parathyroid 
Hormone2 
POGF 
Sema' 

Somatomedin^B- - 
Hormone 
Sorb 

SCF ' • 

Syndecan 
TNFfLce 
TGF-p 
UtcrogloWn 
Opiods^neuropep 

wnt : 



PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
. PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF0O277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



ANATO 
Clq 

Disintegrin 

fSJFSJypejC 

COLFI 

Fnl 

Fn2 • 

Kringle 

MACPF' ... 

Pentaxin 

SAA^rotelns 

Sushi 

TSPN 

TJssue.fac 

Transglutamln^N 

Trans^utamln.C 



' 'Developmental aridihmeostatic 

AdrenomeduUin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CGRP/IAPP family 
Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain * * 

Corticotropln-releasing factor family 
Cystine-knot domain 
Dix domain 
Endothelin family 
Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Clypican 

. ' Cralnin (chromogranin or secretogranin) 
Guanylin precursor 
InsuUn/IGF/Relaxin family 
Insulih'like growth factor blndinc proteins 
Leptin • • . . 

LINK (hyalUron binding) 
Nerve growth, factor family 
Neureguliri family ^ . 
. Neurohypophysial homiones 
Neuromedin U . • ■ . 

Notch (DSL) domain 
Osteopontin - ' ' 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 

Somatomedin B domain . . 

Somatotropin 

Sorbin homologous domain 
Stem cell factor 

Syndecan domain ; - ' 

TNFR/NGFR cysteine-ricK jr^gfon ' \ . • 
* .Transforming growth factor p-like domain 
Uteroglobin family . * * 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

- ' * ■ ^ . Hemostasts 

Anaphylotoxlri-like domain *. . 
> <IIlq domain " ^ -» 

Disintegriri "* . . * 

F5/8 type C domain • 

Fibrillar collagen Otermlnal domain • ' I '' ' 

Fibronect in type I domain * 

Fibrdnectin type II doniain • • 

Kringle domain ;^ . ' * 

MAC/Perforin domain 

Pentaxin family . • 

Serum amyloid A protein 
Sushi domain (SCR repeat) 
Thrombospondin N -terminal-like domains 
Tissue factor 
Transglutaminase family 
Transglutaminase family 



regulators 

1 
2 

100(550) 
3 
1 
3 

.. 14(16) 
1 
2 
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5 
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9 
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0 
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0 
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0 
0 
4 
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3 
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1 
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1 
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0 
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1 

0 
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2 
0 
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0 
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0 • 
0 
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0 
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0 
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0 
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0 
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H 



. W 



Pf00594 Cla 



I PF00711 
Pf 00748 
Pf00666 
.PF0pi29' 

' PF00993 
Pf00969 
PF00879 
Pf 01 109 
PF00047 
Pf00143 
PFO0714 
PF00726 
PF02372 
PF00715 
Pf0O727 
.PF02025 
PF0141S 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PFO6277 
PF0004S 

.PF01582 
PF00229 
PF00088 

Pf00779 
PF00168 
PF00609 
PF0O781 
PF00610 

PF01363 
PF00996 
PF00503 
PF00631 
PF00616 

PF00625 
PF02189 
PFfl!0169 
PF00130 



DefensJn^beta 
Calpaln_inhib 
•CatheUcidins ■ 
.MHCJ . 

MHCJLalpha** 
MHCJLbeta** 
Defensln^propep 
CM^CSF 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IW 
IL5 
IL7 
IL1 

IL1_propep 
113 
1L6 

LIF^OSM 

Defensins 
PTN.MK 
SAA»proteins 
IL8 

TIR 

JNF ' 
Trefoil 

BTK 
C2 

DAGKa 
DAGKc 
DEP 

FYV£ ' 
GDI 

G-atpha . 
G-gamma 
RasGAP 
RasGEFN 

Cuanylatckin 
ITAM f 
PH ^ ^ 
bAC.PE-bind ^ 



PF00388 W-PLOX" 



PF00387 

PF0d64O 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 

PFooraa 

?F00071 
^F0d617 
^F00615 
^F02197 



. Pl-PLC-Y 
PID 

PI3KLp85B 
PI3Jerbd • 
* ArfGAP . 
RBD 

Rap^GAP 

RA 

Ras 

RasGEF 

RGS 

Rlla 



: ^tamin K-dependent carboxylation/gamoia- 
carboxyglutamic (GLA) domain 

^ . . immune response 

Beta defensTn 
Calpain inhibitor repeat • 
. CatheUcidins **..-.- • • \. , . ^ * • 

Class I Mstocompatibility ahtfgea domains alpha 1 
■ and 2 * " • .• • • 

Class II histocompalibiiity antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 

lnterieukin-2 

lntefleuIan-4 . 
InterleuWn-S 
Inlerleukin-7/9 family 
Inlerleukin-I » 
Interleukin-I propeptide 
lnterteukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia Inhibitoiy factor (LIF)/oncostatin (OSM) 
*. family 

Mammalian defensin 

PTN/MK heparin-binding protein 

Serum am>4pid A protein 

Small cytokinefs (intecrine/chemokine), 

interleukin-S like 
TIR domain ^ . " . * 
TNF (tumof necrosis factor) family . 
Trefoil (P-type) domain . \ 

■ P/'PY-rho CTPase signaling 
BTK motif f - n 

CZ domain . 
Dracylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presurfiedj - 
Domain found in Dishevelled, EgHO, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation Inhibitor 
C-protein alpha subunit . 
C-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Kas-Iike 

GTPases; N-tenninal motif 
Cuanylate kinase - *' f^- 

fmmunpreceptor.tyrosine-based activation motif 
PH^main 

Phortol esters/diacylglycerol binding domain (G1 
■ domain) 

Phosghatidylinositol-specific phospholipase C X 
domain ' * . • 

Phosphatidylinositol-specific phospholipdse C Y 
domain 

PhosphotyrosTne Interaction domain (PTB/PID) 

PI3-kjnase family, p85-binding domain 

PI3-kjnase family, ras-blnding domain 

Putative GTP-ase activating protein for Arf 

Raf-Uke Ras-binding doniaTn 

Rap/ran-GAP / " 

Ras association (RatGDS/AF-6} domain 

Ras family 

RasGEF domain 

Regulator of G protein signaling domain 
Regulatoiy subunit of type ft PKA R-subuni t 



11 



•■.3(9) 
• ' .2 
'1?(20) 

. 5(6) 
7 
3 
1 

381 (930) 
7(9) 



2 
2 

2 
2 
4 
32 



0 
0 
0 

''.*•. : . 0 
0 

. 0 
0 
0 

125(291) 
0 
0 
O 
O 
0 
0 
0 
0 
0 
0 
O 
0 
0 

0 
0 
0 
0 



■0 



0 
0 

0- 
*;0 " 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 



0 

. 0 
0 

• ^ 

0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

6 
0 
0 
0 

0 
0 
0 
0 



24.(27) 
" 2 
6 
16 
6(7) 
5 

18(19) 
126 
2.1 
27 
4 



13 
1 
3 
9 
4 
4 

7(9) 
56(57) 
8 

6(7) 
1 



11(12) 
1 
1 
8 
1 
2 
6 
51 

12(13) 
2 



0 
0 
0 

"6 
0 
0 
1 

23 
5 
1 
1 



0 
0 

; 0 
. .0. 

d 

0 

0 

0 

0. 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



• 18 
12 
5(6) 


8 
0 
0 


2 
0 
2-' 


b 
0 
0 


131 (143) 
0 
0 


5 

73 (101) ' 
9 
10 
12(13) 


1 

32(44) 
4 
8 
4 


0 

24(35) 
7 
8 
10 


0 

6(9) 
0 
2 
5 


0 

66 (90) 

11(12) 
2 


28(30) 
6 

27(30) 
16 
11 
9 


14 
2 
10 
•5 . 

5 

2 


15 
1 

20(23) 
5 
8 
3 


5 
1 
2 
1 
3 
5 


15 
3 
5 
0 
0 
0 


.12 
3 

193(212) 
45(56) . 


8 
0 

72(78) 
25(31) 


7 
0 

65 (68) 
26(40) 


1 
0 

. 24 
1(2) ^ 


4 
0 

.23 
4 


12 


3 




1 


8 


11 


' 2 


7 


1 


8 



0 
0 
0 
15 
0 
0 
0 
78 
0 
0 
0 
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Accession 
number 



Domain name*' 



Domain description 



H 



W 



PF00620 
PF00621 
PF00536 
PF01369 
PF00017 
PF00018 
PF01017 
PF00790 

PFobses 

PF00452 
■ PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF00402 
PF00373 

PF00880 
. PF00681 

PF00435 

PF00418 

PF00992 

PF02209 

PF01044 

PF01391 
• PF01413 • 

PF00431 
PF00008 
PF00147 



PF00041 
PF00757 
PF003S7 
PF00362 
PF00052 
PF00053 
PF00054 
PF00055 
PF00059 
PF01463 
^ PF01462 
PF00057 
PF00058 
PF00530 
. PF00084 
PF00090 ■ 
. PFO0092 
PF00093 
PF00094 

PFO6244 . 

PF00O23 

PF00514 

PF00168 

PF00027 

PF015S6 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



RhoGAP 
RhoCEF 
SAM : 
Sec7 
SH2 
SH3 
STAT 
VHS 
WHl 

Bc(.2 
- BH4 
. CARD 

Death 

DED 

BAG 

ICE.P20 

BIR 

Actin 

Annexin 

Calponin 

Band_41 

NebuUn_repeat- 

Plectin^repeat 

Spectrin 

TubuUn-binding 

Troponin 

VHP * 

Vinculin 

Collagen 
C4 

CUB 
EGF 

Flbrinogen_C 



Fn3 

Furin-like 

Integrln^ 

Integrin.B 

Laminin^B 

Laminin_ECF 

Laminin^G 

Laminia.Nt€rm 

Lectin^c 
. IRRCr 
- tRRNT, 
LdLrecept^g 
Ld^^recep'Cb 
SRCR.- . 
Sushi 

tsp^l 

Vwa 

Vwc 

y wd . 

14-3-3 
Artk 

Armadillo seg - 
C2 

cNMP.blnding 

DnaJ^C 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoCAP domain 

RhoCEF domain • ■ 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2J domain * 
Src homology 3 (SH3) domain 
STAT protein - ■' ' 
: VHS domain 
WHl domain 

Domains tnvolved in apoptosts 

BcI-2 homology region 4 ' 
. Caspase recruitment domain 

Death domain 

Death effector domain 
. Domain present In Hsp70 regulators 

ICE-like protease (caspase) p20 domain 

Inhibitor of Apoptosis domain 

. Cytoskeietal 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
• Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family . . 

... BCM adhesion 

Collagen triple helix repeat (20 copies) . ' • 
C-terminal tandem repeated domain In type 4 
procollagen . . ' " . 

CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type II! domain 
Furin-Uke cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins. beta chain • • 

Laminin B (Domain IV) • • • 
Lamlnin EGF-like (Domains lll and V) 
Lamlnln G domain 
laminin N-termlnal (Domain VI)' ' ' 
lectin C-type domain ^ , / ^ ? .■ • 
; / leucine rich repeat C-terrnlnal domain 
.-- leucine rich repeat N-terminal domam * ' 
low-density lipoprotein receptor domain class A 
low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
"Sushi domain (SCR repeat) 
. Thfombospondin type 1 domam^ ^ 
von Willebrand factor type A domain . . • 

von Willebrand factor type C domain 
von VVillebrand factor .type D domain • • •-. *- 

Protein interaction domains 
14-3-3 proteins . 
Ank repeat 

Armadillo/beta-catenln-like repeats 
C2 domain . .. i • • 

Cyclic nucleotlde-blpding domain - 
DnaJ C terminal region 
DnaJ domain 

EFhand *" ^ ' ^■ 

Fes/CIP4 homology domain 
FF domain " ' 

FHA domain 



59 
46 
29(31) 
13 

. 87(95) 
143(182) 
7 
4 
7 

■ 9 

• ; • • . 3 

• 16 
16 
4(5) 
5(8) 
11 
8(14) 

61(64) 
16(55) 
13(22) 
29(30) 
4(148) 
2(11) 
31(195) 
4(12) 
4 
5 
4 



19 
23(24) 
15 
5 

33(39) 
55(75) 
1 
2 

2 * 

2 
0 
0 
5 
0 

3 . 
5(9) 

15(16) 
4(16) 

17(19) 

0 

13(171) 
• 1(4) 
6 

2 • 
. 2 



65(279)- 
6(11) 

-47(69) 
108 (420) 
26 

106(545) 
5- 
3 
8 

8(12) 
24(126) 
30(57) 
10 
47(76) 
69(81) 
40(44) 
35(127) 
15(96) 
.11(46) 
53(191) 
41 (66) . 
34(58) 
. 19(28) 
15(35) 



■ 20 
145(404) 
22(56) 
73(101) 
26(31) 
. 12 
44 

83 (151) 
9 

4(11) 
13 



10(46) 
2(4) 

9(47) 
45(186) 
10(11) 

42 (158) 
2 
1 

• 4(7) 

18(42) 
6 

23(24) 
23(30) 
7(13) 
33 (152) 
9(56) 
4(8) 
11(42) 
. ,11(23) 
0 . 
6(11) 
.3(7) 

3 

72(269) 
11(38) ^ 
32(44) 
21(33) 
9 
34 

64(117) 
3 

4(10) • 
IS 



20 
18(19) 
8 

5 

44 (48) 
46(61) 

i(2) 

4 

2(3) 

1 
1 

•2. 
7 
0 
2 
3 

2(3) 

12 
4(11) 
7(19) 
11(14) 
1 
0 

10(93) 
2(8) 
8 
2 
1 



9 
3 
3 
5 
1 

23(27) 
O 
4 
1 

O 
. 0 . 
0 * 
0 
0 

1 

0 

1(2) 

-9fn) 
0 
0 
0 
0 
0 
0 

0 

0 
0 
0 



8 
0 
• 6 
9 
3 
4 

0.. 

8 

0 

0 
.0 
0 
0 
O 
5 
0 
0 



24 
6(16) 
0 
0 
0 
0 
0 
0 
0 
5 
0 



174<384) 


. 0 


*. 0 


3(6) 


0 


0 


43(6?) 


0 


0 


54(157) 


0 


1 


6 


0 


0 


34(156) 


0 


1 


1 


0 


0 


2 


0 


0 


2 


0 


0 


6(10) 


0 


0 


11(65) 


0 


0 


14(26) 


0 


0 


4 


0 


0 


91 (132) 


0 


0 


7(9) 


0 


0 


3(6) 


0 


0 


27(113) 


0 


0 


7(22) 


0 


0 


1(2) 


0 


0 


8(45) 


0 


0 


18(47) 


0 


0 


17(19) 


0 


1 


2(5) 


0 


0 




* 0 


0 




' 2 


15 


75(223) 


12(20) 


66(111) 


3(11) 


2(10) 


25(67) 


24(35) 


6(9) 


66(90) 


15(20) 


2(3) 


r 22 


5 


3 


19 


33 


20 


93 


41(86) 


4(11) 


120 (328) 




4 . 


0 


3(16) 


13(14) 


4(8) 


7 


17 



-40 
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myelin proteins result in severe dcmyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction IS severely impaired (130), Humans 
have at least 10 genes belonging to four 
different families mvolved in myelm produc- 

Table 18 (Continued) 



THE HUMAN GENOME 

tion (five myelin PO, three myelin proteolip- 
id, myelin basic protein^ and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely. related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
•. humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



■ Accession 
• nomber. 


** Domain name 




FKBP 


Pf 01590 


GAF ' . 


Pr0l344 


Kelcn 


PF0O560 




Pf00917 


MATH 


Pf0d989 


PAS ♦ 


PF00595 


PDZ 


Pf00169 


PH 


Pf01535 


PPR** 


PF00536 


SAM 


Pf01369 


Sec7 


pfoooir 


SH2 


PF00018 


SH3 


PF01740 


STAS 


PF00515 


TPR** 


PF00400 


WD40** 


PF00397 


WW 


PF00569 


2Z 


PF01754 


2f-A20 


PF01388 


ARtO 


PF01426 


BAH 


PF00643 




PF00533 . 


BRCr . 


PF0d439 


Bromodomain 


PF00651 


BTB 


PF00145 


DNA^methylase 


PF0O385 


Chromo 


PF00125 


Htstone 


PF0O134 


CycUn 


PF0O270 


DEAD 



Domain description . * • . 



H 



Y 



\ " A • 



»F00412 
^F00917 
^F00249 
'F02344 
'F01753 
»F00628 
'F0O157 
'F02257 
'F00076 



2f-DHHC 

F-box*» 

ForK_head 

GATA 

G-patch 

HLH** 

Histjdeacetyl ^ - 
Homeobox 
TIG 
JmjC 
JmjN' ' 
KH-ddmaia - 
• KRAB 
Honnonejsreo * . 

UM 
MATH 

Myb.DNA-binding 

Myc-U 

2f-MYND 

PHD 

Pou 

RFK.DNA.binding 
Rmi ' 

SAP 
SPRY 
START 
T-box 



FKBP-type peptidyl-prblyl cis-trans Isomerases 
GAF domain . 
Kelch motif ' • ' 

Leucine Rich Repeat 
MATH domain 
PAS domain 

PDZ domain (Also Known as DHR or GLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-ZInc finger present In dystrophin, CBP/p300 

Nuclear Interaction domains 

*A20-Uke zinc finger 
ARID DNA binding domain 
BAH domain ,. 
B-box zinc finger 

BRCA1 C Tehniniis (BRCT) domain 
Bromodomain • * 

BTB/POZ domain " * " 

C-5 cytoslne-specific DNA methylase 
chromo* (CHRromatin Organization Modifier) 

domain ' • : • » * 

Core histone H2A/H2B/H3/H4 
CycUn 

DEAD/DEAH box heUcase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
GATA zinc finger 
C-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family . 
Homeobox domain 

IPT/TIG domain " ' 

JmjC domain - • 
Jnfijri' domain 
KH ddmarn .. 
KRAB box 

Ugand-binding dorn'aln of nuclear hormone 

receptor 
LIM'Somalh containing proteins 
•MATH domain ^ 
M|yb-Uke DNA-blndIng domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger 

Pou domalrv— N-termtnal to homeobox domain 
RFX DMA-binding domalp . ' 
RNA recognition mo.U7 (a.k.a. RRM, RBD, or RNP 

domain) .• * 

SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 

7(8) 
54 (157) 
25(30) 
11 

18(19) 
96 (154) 
193 (212) 
5 

29(31) 
13 

87(95) 
143(182) 
5 

72(131) 
136 (305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 
97(98) 
3(4) 
24(27) 



. 7(8) 


7(13) 


4 


24(29) 


.2(4) 


1 


0 


10 


12(48) 


13(41) 


3 


102(178) 


24(30) 


7(11) 


1 


15(16) 


5 


88 (161) 


1 


61 (74) 


9(10) 


6 


1 


- 13(18) 


60 (87) 


46(66) 


2 


5 


7Z{7^) 


65 (68) 


24 


23 


3(4) 


0 


1 


474(2485) 


15 


8 


3 


6 


5 


5 


5 


9 


33 (39) 


44(48) 


1 


3 


55(75) 


46(61) 


23(27) 


4 


1 


6 


2 


13 


39(101) 


28(54) 


16(31) 


65(124) 


98(226) 


72(153) 


56(121) 


167(344) 


24(39) 


16(24) 


5(8) 


11(15) 


13 


10 


2 


. 10 



2 
6 

7(8) 
1 

10(18) 
16(22) 
62 (64) 
1 

14(15) 



2 
4 

. 4(5) 
2 

.23(35) 
18 (26) 
86 (91) 

17(18) 



0 
2 
5 
0 

10(16) 
10(15) 

1 (2) 
0 

; 1(2) 



8 

21(25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 



75(81).. 


5 


71 (73) 


8 


48 


19 


10 


10 


11 


35 


63 (66) 


48(50) 


55(57) 


50(52) 


84(87) 


15 


20 


16 


7 


22 


16 


15 


309 (324) 


9 


165(167) 


35(36) 


20(21) 


15 


4 


0 


11(17) 


5(6j 


8(10) 


9 


26 


18 


16 


13 


4 


14(15) 


60(61) 


44- 


24 


4 


39 


12 


5(6) 


: 8(10) 


5 


10 


160 (178) 


100(103) 


" 82(84) 


6 


66 


29(53) 


11(13) 


5(7) 


2 


1 


10 


4 


6 


4 


7 


7 


4 


2 


3 


7 


• 28(67) 


14(32) 


17(46) 


4(14) 


27(61) 


204(243) 


0 


0 


0 


0 


47 


17 


142(147) 


0 , 


■ 0 



62(129) 


33(83) 


33(79) 


4(7) 


10(16) 


n 


. .5 


88 (161) 


1 


61(74) 


•32(43) 


18(24) 


17(24) 


15(20) 


243 (401) 


. 1 


. 0 


0 


0 


0 


. ' - * - 14 


14 


9 


1 


7 


68(86) 


40(53) 


32(44) 


14(15) 


96 (105) 


15 


5 


4 


0 


0 


7 


2 


1 


1 


0 


224(324) 


127(199) 


54(145) 


43(73) 


232(369) 


15 


8 




5 


.6(7) 


44(51) 


10(12) 


5(7) 


3 


6 


to' 


2 


6 


0 


23 


17(19) 


8 


22 


0 


0 
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Accession 
number 



Domain nanr»e 



' Domain description 



PF02135 Zf-TA2 

Pf01285 TEA 

PF021 76 2f-TRAF 

PF0O352 TBP 



Pf 00567 
Pf00642 
Pf00096 
Pf 00097 
PF00098 



TUDOR 
• 2f-CCCH 
2f-C2H2**' 
2f-C3HC4 
2f-CCHC 



TA2 finger 

TEA domain 

TRAF-type zinc finger 

Transcription factor TFIID (or TATA-blnding 

protein, TBP) 
TUDOR domain 

2inc finger. C-X8-C-X5-C-X3-H type (and similar) 

2lnc finger, C2H2 type 
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(Tables 18. and 19). They include secreted 
honnones and growth factors, receptors, in- 
tracellular signaling molecules," and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wni; transfomiing growth fac- 
tor-p (TGF-p), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and cphrins. These growth fac- 
tors affect tissue differentiation and a wide 
. range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
• V ■ pie, our analysis suggests at least 8 human • 
cphrin genes (2 in the fly, 4 in the woraj) and 1 2 
ephrin receptors (2 in the fjy, 1 in the wonn). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the womi) and 
12 fiizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling arc expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, moiphogenesis, and:;-dssue repair 
^ "■ Q^i)' Consistenrwith the weir-defmed role of 
heparan sulfate., prpteoglycans in. modulating ' 
these interactions' (/i2), we obserye an cxpan- 
• sion of the heparin sulfate sulfofransferases in 
the human genome relative tg wonn and fly. 
Itese sulfotransferases modulate tissue differ- 
entiation {133), A similar expansion in humans 
is noted in structural proteins that constitute the .. •; 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of tH'c nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved m mod- 
ulating the actin-cytoskelelon with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



. - Comparison across the.five sequenced eu- 
karyptic orgam'sms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GIP 
exchange factors associated with them. Al- 
though there .are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
■ iri phosphotyrosine signal transduction. Fur- 
ther, there is a hvofold expansioii of phos- 
. • phodiesterases in..the human gehonie. com- 
pared with either the worm of fly genomes. 
, The downstream*effectors of the intracellu- 
lar signaling molecules include the transcription 
. factors that transduce developmental fates. Sig- 
m'ficant expansions are noted in the hgand- 
, binding nuclear hormone receptor class of tran- 
scription factors compared witfi the fly genome^' 
although not to the extent observed in the worm 
(Tables- 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins; Compared with 771 in 234 fly proteins. 
This means that , there has been a dramatic 
expansion not only^ in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding . motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, aAd 2.3 on average in the worm).* 
Furthermore, many of these transcription fac- 
tors contain either the KRAB» or SCAN do- 
mains, which arc not found in the fly or worm 
genomes. These domains , are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, biit the reassortmeh^ of these 
domains results in organism-specific transcrip- 
tion factor famiUcs. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



homeodomains alone or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a um'que set that includes VPl 
and AP2 domain-containing proteins {134). 
The yeast genome has a paucity of transcription 
factors, compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation'. 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served. An interesting . observation, is that ' 
worms and humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It Is im- 
. portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
• wide repertoire of interaction domains with 
- significant combinatorial diversity. 

Hemostasis. Hemostasis is regulated pri- 
-.marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FNl, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there, has been extensive re- . 
cniitment of more-ancient aiiimal-specific 'do- ■ 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into mulridomatn proteins that are 
involved in hemostatic rcgulatiotL Although we 
do not find a large expansion in the total nimi- 
ber of serine proteases, this enzymatic domain 
has been ^cifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
leprcscnted in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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sigmficant expansion in two families of niatrix 
metaUoproteases: ADAM (a disintegrin and 
meuUoprotease) and MMPs (rmtnx metaUo- 
protease^ (Table 19). Proteolysis of extracel- 
lular matnx (ECM) proteins is critical for tissue 
development and for tissue degraddtfon in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ms Tit "^^J^^^^V conditions 
U35 136). ADAMs are a family of inteeial 
inembrane proteins with a pivotal role'in fibrin- 
■ .. ogenolysis and. modulating intefaclions -be-- 
• tween hematopoietic comijohents V aid 'the ' 
vascular matrix components. These proteins 
have been shown, to cleave matrix proteins 
and even signaling molecules: ADAM-17 

factor-<t. and 

ADAM-10 has been implicated in the Notch 
signalmg pathway (,135). We have identified 
19 members of the matrix metalloprotease 

A f r.*""' of the 

ADAM and ADAM-TS families. 

Apoptosls. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukatya is consistent with its central 
role m developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis. axe medi- 
ated by mteractions between" Well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domams, and those found in effector and" 
regulatory enzymes {137). We enumerated 
the protein counU of central adaptor and ef- • 
fector enzyme domains that are found only in" 
the apoptotic pathways to provide an estimate 
of divergence across cukarya and relative 
expansion m the human genome when com- 
pared with the fly and worm (Table 18) 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers 1 dec BIR, CARD, and Bcl2 are represent- 

'i u '/^.f "'^ ""^^ (although thS n'umber 
)f Bcl2 family members in humans is signif- 
cantly expanded). Although plaAts and yeast 
ack the caspases. baspase-like molecules, 
iamely the para- ind meta-caspases, hayp 
een reported in these c^rgam'sms (;55). Com- 
ared with other animal genomes, the hiifhan' 
enome shows an expansion in' the adaptor 
3d effector domain-containiiig proteins in- 
Jlved m apoptosis, as well as in the-prb'- 
ases mvolved in the cascade such as. the 
ispase and calpain families. • ' 

Expansions of other protein families. 
etabohc enzymes. TTiere are fewer cyto- • 
romc P450 genes in humans than in either - 
5 fly or worm. Lipoxygenases (sU in hu- 
ms), on the other hand, appear t^ be q^cific 
the vertebrate-s and plants, whereas the lip- 
irgenase-activating proteins (four in human^ 
y be vertebrate-specific. Lipoxygenases are 
olved m arachidonic acid metabolism, and 
Y and their activators have been implicated 
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in diverse human ' patholoiy ranging from • 
allergic responses to cancers. One of the most 
nupnsmg human expansions; however, is in • 
■ the number of gIyceraldehyde-3 -phosphate - 
dehydrogenase (GAPDH) genes (46 in hu- 
mans. 3 in the fly. and 4 in the worm). There 
IS, howcvftr, evidence for many retrotrans- 



posed GAPDH pseudogenes (_139), which 
may account for this apparent expansion 
However, it is mteresting that GAPDH. lon<» 
loiQwn as a conserved enzyme invblved fa 
basic metabolism found across all phyla from 
bactena to humans, has recently been shown 
to have other functions. It has a second cat- 
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alytic activity, as a uracil DNA glycosylate 
{140) and functions as a cell cycle regulator 
{141) ^nd has even beeni implicated in apo- . 
ptosis {142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translation^! machinery. 
We identified 28 different ribosomal subunits 
.that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins . 
there is about an 8- to 10-foId expansion in 
the nimiber of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

»• . • . ■ ■ ' 
Table 19 (Co/7 t//7£/ec/) 



The Human genome 

, may account for many. of these expansions 
.[see the discussion above and {14S)]. Recent 
. evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in hiunans) have been 
shown to induce apoptpsis {144). 

: There is also a four- to fivefold expansion 
in the elongation factor 1-alpha family 
(eEFlA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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• transposition, and again there is evidence that 
> many of these may be pseudogenes {14S), 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFlA {146). 

- Ribonucleoproteins.-Altemztiyc splicing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein complex 
ment. We have identified 269 genes for ri- 
bonucleoprotems. This represents over 2.5 
times, the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arahidopsis genome. Whether the diversity 
of ribonucleoprotein genes in htmians con- 
) tributes to gene regxilaU'on at either the splic- 
) ing or translational level is imlmown. 
► Posttranslational modifications. In this 
* set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent en2ymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
raostasis and apoptosis {147). The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
csteocialcin, and matrix GLA protein {148), 
Tyrosylprotein. sulfotransferases participate . 
in the posttranslational modification of pro- 
teins involved in inflammation and hembsta- 
sis, including coagulation factors and chemo- 
kine receptors {149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that arc not found in the other currently se- 
quenced genomes. These include the tandem 
. association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-inleraction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for' the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate tq the. prominent differences in 
the imiriune ' system, hemoslasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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mcrcase m the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (750. Evolution of ap jarentTy ne w 
(from the perspective of sequence ZyZ 
protein domains and incre^bg' f eguS 
complexity by domain accretion both^uaatZ 

tativeIyandqua!itativcly(recn.itmentofnov. 
cl domains with preexisting ones) are two 

fff^f?^^twe.observeinhuman;.Perha^^ 
the best.illustration of this trend is ihe C2H2 ' 
zmc finger-containing-transcription factors' 
. where we .sec expansion in the number of 
domains per protein, together, with verte- 
o^i^v?!^'^^ domains such is KRAB and 
SCAN. Recent reports on the prominent use 
of internal nbosomal entry sites in the human 

cl^T^J'' translation of specific 

Classes of protems suggests that this is an area 

SLT?*^^'' '"'^"^ identify the full 
mnAf^' ^ ^^^ome 

{UJ). At the posttranslational level, althoueh 

Tr^/'-'''^^ .1^^^^" Of expansions of soiSe • 
protem families mvolved in these modifica- 
tions further experimental -evidence is re- 
quired to evaluate whether this is correlated 
with mcreascd complexity in protein process- 
ing. Fosltranscriptional processing and the 
extent of isoform generation in the±uman 
remam to be caUloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to" 
dissect regulation at this level, 

8 Conclusions 



ai The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencmg approach to a diverse 
group of orgam'sms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
mccess of the method for a large niimber of 
nicrobial genomes, Drosophila, and now the 

'Z^'S'll' \' ^^"^^ 'concerning the 
»t»lity of this method. The large number of 
•iicrobial genomes that have been sequenced 
y this method (/J, 80; J 52) demonst^tc'tK^ 
^g.?bi,se.s,2ed genohies.can be sequeifSed 
fficiently without any input othet that the de 
Dvo mate-paired seqa^jrieesV Witli more 
implex genomes like those of Drosoph Ha or 
man map information, in the form of weli- 
•dercd markers, has been critical for*l6ig- 
□ge ordering of scaffolds. For joining scaf- 
ds into chromosome^, the quality of the 
ip (in terms ofthe order ofthe markets) is - 

the number of markers 
* sc. Although this mapping could have 
.n performed concunrently with scqucnc- 

iVncM^^C''^^''^'^'^ of mapping data was 
.leficial. Durmg the sequencmg of theyf 
Itana genome, sequencing of individual 
C clones pcnnittcd extension of the se- 
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quencc well iato ccntromeric regions and al- 
lowed hijgh-quality resolution of complex re- 
peat regions. Likewise, 'in Drosophila, the 
BAG physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstnictions of the 
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predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

' J. B. S. Haldane speculated in 1937 that a 
population of orgam'sms might have to pay a 



^ ir^t. r " • ^w^«»«Mwu w migni nave lo pay a 

unique regions of the genome. As the genome. price for.the number of genes it can possibly 

Size, and mnrA fmn/it49nf1«/ . . . ^ ^ l^v^^iwijr 



size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall ejfficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific- applica- 
tions of B Ac-based or other clone mapping and 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
aredearly worth explotmg. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAG clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of botbwhole-genome and 
BAG shotgun sequence data. 

8.2 The low gene number In humans 
• • We have sequenced and assembled —95% of . • 
the euchrpmatic sequence ofN, sapiens and 
used a new automated gene prediction meth- 
od to' produce a preliminary catalog of the • • 
human genes. This has provided a major sur-.. . 
prise: We have found far fewer genes (26,000 ' 
to 38,000) than the earlier molecular pre- 
dicHons (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mas musculus ge-. 
nome), and careful molecular dissection of 
complex phenotypes wDI clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in. the 
years to c6me as the precise sfructure of each 
transcription imit is evaluated: "A good place 
to start is to detenriirie why the gene esti- 
mates derived'irom EST data are so discor- 
dant with our predictions. It is fitely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'. and S'-imtransIated leadere and trailers; 
the little-understood vagaries of RNA pro- : 
cessing that often leave intronic regions in an 
unspliced condition; the finding' that nearly 
40% of human genes are alternatively spliced } 
{153); and finally, the unsolved technical 
problems in EST libraiy construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protem data to support them, al- 
though our use of mouse genome daU for 



cany. He theori2ed that when the number of 
.genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot maintain itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, .Muller, in 1967 
(75^), . calculated that the mammalian ge- . 
nome would contain a maximum of not much 
more than 30,000 genes {155). An estimate of 



• of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clmical and biological relevance {161), Final- 
■ ly, examples of translational control include 
internal ribosomal entry sites that are found 
m proteins involved in cell cycle regulation 
and apoptosis (y52). At the protein level 
mmor alterations in t&e .nature of protein* 
protem interactions,- protein • modifications 
and localization can have dramatic effects on 
cellular physiology {163). This dynamic sys-^ 
tern therefore has many ways to modulate 
activity, which suggests that defmition of 
complex systems by analysis of single genes • 
is unlikely to be entirely successfiil. 

.. In situ studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes (68\ 



30.000 gene loci forhun^ans was also anived . '.However ^^^^^^^ (*^?>- 
atbyCrowandKimura(/55).Muller'sesti. as r.^^^lS »! 1?^*^! 



at by Crow and Kimura {156), Muller's esti- 
mate for/), melanogaster was 10,000 genes, 
compared to. 13,000 derived by annotation of 
the fly genome {26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certam low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no disr. 
ceraiblc phenotypic perturbations. * 

The modest number of human genes- 



as unequally as had been predicted (Table 9) 
{69). The inost G+Crrich fraction of the ge- 
. nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome {71). Why 



n^eans ^at we n>ust look elsewhere for the - -^.^.^^^^Z^ 
mechamsms. that generate the comnWiti..^ .r.». v^J„.:^, nign ana low 



■mechanisms, that generate the complexities 
.inhererit in.human development and the so- 
phisticated, signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the fiinctions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure.' 
and hence transcriptional activity Is regulated : 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The locatidii, .timiiig, and quantity of tran- 
scription are intirnately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elemehts that include insulators, re- 
peats, and endogenous viruses {157); methr /. 
ylation of CpG islands in imprinting {158); 
and promoter-enhancer and intronic regions 
.:that modulate transcription. The spliceosomal 
. machinery consists of multisubunit proteins • 
(Table 19) as well as structural and catalytic 
RNA elements {159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules {160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to fmd mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
, sizes that are much smaller than that of hu- 
. 'mans; for'cxamjple, Minioptenis, a species of 
.lulian bat, has a genome size that is only 
50% that of humans {164). Similarly, Mun- 
tia'cus, a species of Asian barking deer, has a 
genome size that is —70% that of humans. 

8.3 Human DNA sequence variation 
and Its distribution across the genome 
This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
. has been completed Although we have identi- 
' fied and mapped more than 3 milh'onSNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA canies with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availabihty of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnoeeo- 
graphic origins, providing insights into popula- 
tion histoiy and migration patterns. Although 
such studies have suggested that modem human 
lineages denve from Afiica, many important 
questions regarding human origins remain un- 
a^cred, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration,.'and admix- 
ture, SNPs can serve as markers for the .extent ' 
of eyolutioiiaiy constraint acting on particiilar 
. genes. The conrelation between patterns of in- 
• traspecies and interspecies genetic .variation 
may prove to be especially informative to iden- 
tify sites of reduced genedc diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165), The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portfon of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in 'the 
population as there are autosomal chromo-*' 
somes, and the level of polymorphism on the 
Y. is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vaiy because the density of 
deleterious mutations may vary. Regions of 
high density, of deleterious rhutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (7(^5). As a result, the density of 
even completely neutral SNPs will be lower 
m such regions. There is a large literature 
on the association between SNP /density ' 
a.9d local recombination rates in Drosoph-:- 
ila, and it remains an, important task to 
assess the strengl^i-orthis association in the 
human genome, because of its impact oh 
the design of local SNP densities for dis- 
ease-association studies. It also'V^mains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
3f heterogeneity among geographic and 
ithnic populations, • 
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then docks on this, and then the' complex 
moves there. . . (167) to the exciting area 
of network perturbations, nonlinear re- 
. sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other '^arts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number. 



8,5 Beyond single components 
While few would disagree with the intuitive 
conclusion that Einstein^s brain was more 
complex than that of Drosophiia, closer com- 
pansons such as whether the set of predicted 
human proteins is more complex than" the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein 



nor number of cell types correlates ranv' ZttZl. '^T-^ 
meaningful maniier wKenSt^^^^^ P-^e^n domam, or protem-^^^^ 



3.4 Genome complexity ' ' 
Ve will soon be in a position to move avvay 
rom the cataloging of individual compo- 
tents of the system, and beyond the sim- 
hstic notions of "this binds to that, which 



mean^igful manner with cven.simplistic mea- 
. sures of Stmctural .or, behavioral' corriplexity. 
Nor would they be expected to; this is the reahn 
. of nonlinearities and epigenesis (7^5). The 520 
milHon neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a. 
comparison of genomic data on the mouse and 
human, and from comparative mammaUan neu- 

• roanatomy (169), that the moiphological and 

• behavioral diversity found in mammals is un- 
deipinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to .a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm', two ' 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are httle different from 
those of chimpanzees. Between humans and . 
chimpanzees, the gene number, gene stmctures 
and functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost indistinguishable, yet &e develop-* . 
mental modifications, that predisjxjsed human 
lineages tp cortical expansion and deyelopmerit 
of the larynx, giving rise to language, culminat- ' 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination ofthe number of neu- 
rons, cell types, or genes or of the genome 
size does hot alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions witfiin and among these sets 
. . that result in ^qh great variation. In addition, 
. it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
. portionate effect on the -overall system. We 
have presented several examples of "regula- 
tory genes" that. are significantly increased in 
the human genoji^xompared with flie fly and 
worm. These include extracellular ' ligands 
and their cognate receptors (e.g., wnt,- friz- 
zled, TGF-p, ephrins, and connexins), as well 
as* nuclear regulators (e.g., the KRAB and 
homeo^omain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene fanulies and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



'measures do' not capture cbhtext-dependeht* 
interactions that undeipin. the * dynamics im-' 
derlying phenolype. • • . 

Currently, there are more than 30 different 
mafliematical descriptions of complexity *(/70: 
However, we have yet to understand the math- 
ematical dependency relating th'e number of 
genes with organism complexity. One pragmat- 
. ic approach to the analysis of biological sys- 
: tems, which are composed of nom'dentical ele- 
ments, (proteins, protein complexes, interacdng 
. cell types, and interacting neuronal popula- 
tions), is through graph theory (171), The ele- 
ments ofthe system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The eaor toler- 
ance of such networks comes with a price; they 
^ are vuberable to the selection or removal of a 
few nodes that contn>ute disproportionately to 
network. stability. Gene.knockoiits provide an * 
JUustrahoa 'Some-Ioiockquts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
^ posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively nonnal, with no obvious pheno- 
- typic effects (/72), and yet the usually conspic- 
■ nous' vimentin network is completely absent 
On tiie other hand, -30% of knockouts in 
Drbs'ophUa and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, aldiough even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic bacltground. Thus, there are 
no "good" genes or *^ad" genes, but only net- 
works that exist at various leveb and at differ- 
ent connectivities, and at different states of 
^ensitivity to perturbation. Sophis6cated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address networic dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity,'.' particularly because 
deconvoljiting and conrecting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and^through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
. and exciting journey toward understanding 
the role of the genome in human biology. It : 
has been possible only because of innova- 
tions in instrumentation and. software that 
have allowed automation of ahnost every step 
of the process from DNA preparation to an- • 
. notation. -The next steps arc clear; We must 
. define the complexity that ensues when this 
: relatively modest set of about 3 0,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype ^ 
depend. It provides the boundaries for scien- 
tific mquiiy. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
Identified; thek functions, in concert as well 
as m isolation, defined; their sequence varia- 
tion worldwide described;' and the relation 
between genome variation and specific phe- 
notypic characteristics determined Now we 
know what we have to explain. 

. Another paramount challenge awaits* ' 
pubhc discussion of this information and its' ' 
potenHal for improvement of personal health 
Many diverse sources of data have shown 
that any two individuals are more than 99 9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls m a mere 0.1% of the sequence. There 
are two fallacies to be avoided: detemiim-sm 
the Idea that all characteristics of the person 
are ;*hard-\vired" by the genomei-and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
mg of gene functions and interactions will 
provide a complete causal dcscriptiWof hu- " 
maji variability; The real challenge- of human / - 
biology, beyond the task x)f finding out how 
genes orchestrate tliQ construction and main- 
tenance of the miraculous mecham'sm-of our 
jodies, wilMie ahead as we seek to explain 
low our .minds have .come to 'o'rganire 
houghts sufHciently well to investigate our 
nvn existence. . . 
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ripheral blood lymphocytes from aU samples select- 
ed for sequendng; aU were normal A two-staged 
consent process for prospective donors was em- 
ployed. The first stage of the consent process prcH 
vided Information about the genome project, pro- 
cedures, and risks and benefits of partidpating. The 
second stage of the consent process Involved an- 
swering follow-up questions and signing consent • 
forms, and was conducted about 48 hou>s a*ftef the 
first 

33. DNA was Isolated from blood (773) ci* s^eW'Por • ■ 
. sperm, a >yashed peUet (100 ^1) was lysed In a ' 
. suspension (1 ml) containing 0.1 M NaCI, 10 mM 
tcis.CW0 mM EOTA (pH 8). 1% SD5. 1 mg protein- 
ase. K; and 10 mM dithiothreitol for 1 hour at 37*C 
The lysate was extracted with aqueous phenol and 
With phenol/chlorofomi The DNA was cthanol pre- 
cipitated and dissolved In 1 ml TE buffer. To make • 
genomic libraries. DNA was randomly sheared, end- . 
polished with consecutive BA31 nudease and T4* 
DNA polymerase treatments, and size-selected by 
electrophoresis on IX low-melting-point agarose 
After Ugation to Bst XI adapters (Invftrogca catalojt 
no. N408.18). DNA was puriHed by three rounds of 
• gel electrophoresis to femove excess idapters, and 
the fragments, now with 3*-CACA overhangs, were 
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msefted into Bst XMinearized plasmid vector with 
3 .TGTC overfiangs. Ubraries iith three different 

icfip. The. 2-kbp fragments were doned In a 
hrgh-copy PUC18 derivative. The 10- and SO-kbo 
fragments were cloned In a medium-copy pB^2^ 
denvative^ The 2- and lO-kbp libraries ^ell JZnU 
S^Tv^ colonies on pUting. Ho wever. X 

' ir^^^^i P ""^ ""'""y colonies and 
imerts. were umlabte. To remedy this, the 50-kbp 

•cleTve th?v ^^"'k^ ^'^'"^^ not 
cleave the . vector, but generally, cleaved severa 
tirnes w. thin the SO-kbp Insert A'^l264-bp Bam H • 
kanamycm. resistance cassette (punf?^ from 
PUCK4; Amersham Pharmacia, catalog no. 27.495&^ 
01) was added and ligation was carrifd cut at 37-C 
|n the continual presence of Bgl II. As Bgl ll^Bs\ II 
• ^^tions occurred, they were continually d„v j * 
whereas Bam Hl-Bgl 11 ligations were not deavedA 
high yield of internally deleted circular Ubranr mX 
ecules was obtained In which the residual |«ert 
nwi were separated by the kanamydn cassette 
. DN/L The Internally deleted libraries, when plated 

ciltin (SO Jig/ml), and kanamydn (15 |ig/mn, pro- 
duced relatively unifom, large coloniesTnie r«uU- 
Ing dones could be prepared for sequendng iislnc 
UbraZT P'''"^'"" " ^«>'rf the 10-kbp 

34. Traasformed cells were plated on agar diffusion 
pla tes prepared with a fresh top layer containing no 
antibiotic poured on top of a previously set bottom 
layer containing excess antibiotic, to adileve the 
conrect Hnal concentrattoa This method of plating 
permitted the celU to develop antibiotic resistance 
before being exposed to antibiotic without the po- 
tential clone bias that can be Introduced through 
: -bquid outgrowth protocols. After colonies had 
grown QBot (Cenetix. UK) automated colony-pick^ 
mg robots were used.to pick colonies meeting strin- 
gent size and shape criteria and to InocuUte 384- 
■ ■ ''I'^" ^<l^*<i growth me- 

' ' ^T,h^2" J"'" "^'^ Incubated overnight, 
with shaking, and were scored for growth be^re 
fl"'!? *°?«'"P!^'eP«Pa''atron. Template DNA was 
- extracted from liquid bacterial culture using a pro- 
cedure based upon the alkaline lysh mlnlprep meth- 
od (775) adapted for high throughput procising in 

and ifffl^nM.''" ^> centrifugation; 

and pUsmid DNA was recovered by Isopropanol 
precipitation and resuspended In 10 mM tris-HCl 
buffer Reagent dispensing operations were accom- 
plished using Titertek MAP 8 liquid dispensing sys- 
t ms. Plate.t9.plate liquid transfers wer^t perfonSed 
wing Tbmtec Quadra 384 Model 320 pipetting ro- 
bots. All plates were tracked throughout processing 
by unique plate barcodes. Mated sequencing reads 
from opposite ends of each done Insert were olx 
tamed by preparing two 384-welI cycle sequencing 
reaction plates from each plate of plasmid tempUtc 
DNA using ABI.PRISM BigDye Terminator chemistiy 
(AppUed Biosystems) and standard M13 forward 
and reverse primers. Sequencing reactions were pre- 
pared usmg the Tomtec Quadra 384-320 pipetting • . • 
robot Parent-diild plate relationships and. by ex- 
* '«,7a''d-«vef5e sequence mate pairs were 

estabU^ed by automated plate barcode reading by 
the onboard barcode reader and were recorded by 
<rirect UM5 communicatioa Scquendng reaction • 
w^T T'^ ^^f ^y predpitatioQ and 

vn n «ored at 4-C In the dark 

tmui needed for sequendng, at whidi time" the 
reaction products were resuspended In delonlied 
foirnamide and sealed Immediately to prevent deg- 
radatioa All sequence data were generated using a 
single sequencing platform, the ABI PWSM 3700 
- DNA Analyxer. Sample sheets were created at load 
tirne u„ng a Java-based application that fadUtates 
barcode scanrxmg of the sequencing pUte barcode 
retrieves sample Information from the central UMs! 
and reserves unique trace Identifiers. The appUca- 

«r? <nrectoiy and deleted previously aeated 
sample sheet files Immediately upon scanning of a 



Science 238, 336 (7987} ''" ^ *f 
36. Celera'i computing cnvironAient [s h>^A « r- 

fCPUl <ni*^ ■ " central process nJL unit 

Af«>,,7lM ^^^'"P"*^ ''arm 1$ composed of M 
MHr - running at 667 - 

(RAJO-O). dirt strlp?4 ftSiTil • • 
v!th parity (RAIO-S)"* «'««:. 

tor and adapter jequence from high-quaUty reads 

?o*tatd"'?t:ef::t'"^^^^^ 

mean accuracy of MsJ rtT"'*''? * 

by Celcra mate pair,. An fntervil of a bac^fe 
d«med an a5«n,bly error where there wfe^o 

Yoric 1996). pp. 73-89 ' ' "^^^ 
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fragmentj on the jcaffoldj waj analv^M if ,1, 
spread of the« fragment, y>^ crV^^tJf 
tjme, the reported BAC length ,^7 ^ Ion 

■ km rate. * «'ch™,? 
4a. M. Hattorl ef a/., NHur^ 405, 31 1 (2000) 
50 i b"?,"",!' "^""^ «9 (1999)." • 
,,r^"f-^'^f ^^ 97, 13239(2060). '^r"''""'- 

• ■■ ^""k^?"' ".^ Con^oriium. avaHable ' - 

ff ■ f P-.f«''uIe'-, rre/,* ftofecAnot 16, 456 (199a) 
54. S. F. Atechul. W. Cl5h, W. Miller, E. W. MyeraT 1 
^'Prnaa/Afotfi/ot 215, 403 (1990) ""-"-J- 
|5a.M. Oi;vt*ret a/., 5cfe«« bl, 1298(2001) : 
55b.See httpi7genome.U£jc«du/. ' 

56. N. Chauaharl W. E. Hahn. 5c&r,ce 22©: 924 (1983)- 

5497 o'saV ■ """"" 

57. D. pfck5on. Nature 401, 311 (1999). 

«■ S" * °"*"' ^''"^ Cenef. 25. 232 (20001 
(I'o^)'" is^^ls 

60. M. YandeH, fn preparatioa 

r«/i</s Ce/iet 16, 44 (2000). »^^gioct, 
62. Scaffolds containing greater than 10 kbp of se- 
quence were analyzed for features of biological 
rmpojtance through a senes of computationaT«^epr 

For scaffolds greater than one megabase, the se- 
quence was cut fnto single megabase piec« before 

?or'"cS™r" ""'^^''^ was miked 

for complex repeats using RepeatmasJcer (52} before 

- gene finding or homology-based analysis The com! 
.putational pipeline required ^7 hourfof CpS "m e 
per.m.egabase, (ncluding repeat masking, or a toTal 

. compute time of about 20,000 CPU hours Protein 

• oro[?"/.T ''^^"^"'^^ ^^^'-^ nonr dundant " 
sear^hl. 'J'"'^'^ ^CBI. Nucleotide " 
and r J r 7'' P^^^^"'*^ ^S^M human, mouse! 
W " V ^'"f (assemblies of cDNA • 

•. and EST sequences}, mouse genomic DNA reads " 
. ..generated at Celera (BX). the Ensembl gene data 
t^t^fX' Bioinfomfatr« fr^ . 

tute (£81). human and rodent (mouse and rat) EST 
data sets p^rs,<J from the dbEST database (NCeQ 
DK.A ^."''1'^ '"^"^ «efSeq experimenta 

"^?m,4f^'^t>ese (NCBf). Initial searS,eswer?pe^^^ 
formed on repeat-masked sequence with BLAST 2 0 
{S4) optimized for the Compaq Alpha compute- 
server and an effective database size of 3 X 10 "for 

^ 10' for BLASTX searches 
Additional f»rocessing of each queor-subject pair 

t^ n m^r"" u al'-gnments. All pro- 

te.n BLAST results having an expectaUon score of 
- -<1 X ^0-^humal1nucled^ide BLAST results havfnff 
• M *fP*^'^^J;«V^<"'e of <1 ;<,io-« with >945f 

- rdenlity. and rodent nucleotide BIAST results having 
an expectation score of <1 x^ lO- with >80% 
Identity were then examined on the basis of their 
h.gh-sconng pair (HSP) coordinates on the scaffold • 
to remove redundant hits, retaining hits that sup- 
ported possible alternative spUcing. For BLASTX 
searches, analysis was i^rrformed separately for se- 
lected model organisms (yeast mouse, human C 
eteg3ns and O. melanogaster) so as not to. exclude 
HSPs from these organisms that support the sime 

judged to be Informative, nonredundant, and suffi- 

rS^'; "Yk*^ '^^^^^ ^^^^ 
realigned to the genomic sequence with Sim4 for 

"rs. and with Lap for proteins. Because both of 

these algorithms take spUdng fnto account, the 

resulting alignments usually give a better represen- 

tation of fntron-exon boundaries than stamSard , 

BLAST analyses and thus facHitate further annota- * 

Won (both machine and human). In addition to the 
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homofo^-based analysis described above three ah 

63 " r°ifs''I^'^^'"''°" were used (63) 

63. E. C Uberbacher. Y. Xu. K I. MoraL M^thS r 

/no/. 266. 259 (1996); C Burge S kI^^ /ZK 
Biol 268 78 i I iJ i .7* f^^'"*/ MoL 

303; 77'(JLy);"''^i"urolTr5^r°'- 
Cenome Res. 10, 516 (2000):Hor;aI r 
*«. 8. 967 (1993) ' ' " ^"Oi"* 

. (BIOS Saentific; Oxford, 1996) -DD 121 i^c ' 
. 66. J. E. Hofvath. S. Schwar^^ E t J^- ^' ■ * 

10. 839 (2000). ' ^^"^^y- 

• (Ym9}^'^"^^' ^ ^^^^^r. trends Cenet. 5. 144 ' 

•68. G.P.Hotmquist.>lm.y./yt/a,;ce/7e^ 51. 17(1992) 
69. C. Bemardi, Cene Z4X 3 (2000) ^ 

• ("996}""''' '^V^'"^' C;e.e 174. 95 

tJ* t' u^'^t' ^''^"^ "^^O (1985). 

74. A Bird, Jrendf Cenet 3, 342 (1987). 

«i (imT'''"'*'^ 

(1995):""' 5. 309 

j^yP^'«"°'"««»f»loW.com/2000/l/5/revlem/ 

L''r9,''2«l"^'''^^"''«"*'^««''"-«'''- 

90:^5r95T,9«j'''* ^''^ 
81. 5. H. Cro« ef at, Mamm. CenOme 11, 373 (2000) " " 
|2. D. SUvov et at, Cene 247. 215 (20OO). ' ' • 

0995)""^ '"^'l 23, 98 - 

1^ V V^'iif v"*'- '^'''"'^°'- «^"''^ 9-2"7(2000) 
N. Chkhetdie, 5. A. UeChaber 
J. Biol. Chem. 274. 24S49 (1999) "naoer, 

.86. V. Paa W. K. Decker. A H. H. M. Hu,. W.J. Crateen 
- Cenomfci 59, 282 (1999). *"'J.»-ra«ea 
87. P. Noovel CeneOea 93, 191 (1994) 

10^67y {"^J.""'"-. 

wn-parej all proteins In the proteome to 

w^Sf-f" ^ hit between two protein^ 

S^r^if, M "P'?'"-" beneath a u«r-5pedfie^ 
^?aoh ,i »" then m« this 

lafr ff irr'',"*?^?'"^'!^'"'*"" protein 
Smo£^ ^- °' ^'^f^ " ' whole by 

common between the two proteim by the total 
r;ma,n-^ ng propertiej. Firrt, becaoie the 

es at the level of BLAST hits, the metric respeis the 
maL tT,*'" P"t«'" »P««- Two W,ultIdo- 

S5f J!?- containing do- 

to each other than either one will have to a protein 
containing only A or B domains, so long ai A-B- 
containing muIUdomain protein, are le^s freooent In . 
tte*""". ^'"il'-d^-"'!" proteins con- 

S*-"',^ ^"T"'^ * Interesting prop- 

3u« '■"'^"y nie trie fa that It can be used to 
SIbU If^K '"^i"^ """^ proteome as a 

whole without having to first prodlce a multiple 

ric doe?^„o?"";""f "« P"^*«- 
nc does not require that either sequence have jla- 

defined ilmilanty to each other, only that they 
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share at least one significant BLAST hit in common. 
This 1$ an especiaUy Interesting property of the 
metdc because It aUows the rapid recoveiy of pro- 
tein families from the proteome for which no mul- 
tiple alignment Is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, Lek first par- 
titions the proteome fnto single- linkage clusters 
{27) on the basis of one or more shared BLAST hits 
between two sequences. Next, these single-Iinkage 
dusters are further partitioned Into subclusters, 
each member of which shares a user-specified parr- 
wise similarity with the other riiembers of the clus- . 
.ter, as described above. For the puiposes of this 
publication, we have focused on the analysis of 
single-linkage dusters and what we have termed 
•complete dusters," e.g^ those subclusters .for 
which everymember has a similarity metric of 1 to 
eveiy other member of the subcluster. We" believe 
that the single-linkage and complete clusters are of 
special interest. In part, because they allow us to 
estimate and to compare sizes of core protein sets 
In a rigorous manner. The rationale for this Is as 
follows: If one Imaginos for a moment a perfect 
dustering algorithm capable of perfectly partition- - 
Ing one or more perfectly annotated protein sets 
Into protein families, it Is reasonable to assume that 
the number of dusters will always be greater than, 
or equal to, the number of single-linkage dusters.' 
because single-linkage dustering 1$ a maximally ag- 
glomeratlve dustering method. Thus. If there exists 
a single protein In the predicted protein set contain- . 
Ing domains A and B. then It will be clustered by 
single linkage together with all single^^tlomain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
main protein, the number of real clusters must"' 
always be less than or equal to the number- of •• 
complete dusters, because it is Impossible to place • 
a unique multidomain protein Into a complete clus-' 
ter. Thus, the single-linkage and complete clusters 
plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms* predicted protein seL 

90. T. F. Smith. M. S. Watennan.7. ^ol BhL 147, 195 
(1981}. ' • 

91. A. L Detcher et af.. Nucleic Acids Res. 27, 2369 

(1999) . 

92. Arabl<Jopsis Genome Initiative, Nature 408 796 

(2000) . 

93. The probablUty that a contiguous set-of proteins Is 
the result of a segmental duplication can be esti- 
mated approximately as follows. Given that protein 
A and B occur on one chromosome, and that A' and 
B* (paralogs of A and B) aUo exist In the genome, 
the probabiUty that B' occurs Immediately after A' 
1$ VN. where W Is the number of proteins In the set 
(for this analy$is,./V « 26,583). Allowing for B' to 
occur as any of the next J-1 proteins (leavings gap 
between A' and B' Incre^ies the probafiifity to ij-' ' 
\)fN; allowing B'A' or A'B' gives a probability of 2(y 

1}/AI]. Considering tflre^ genes ABC, the probabil- 
ity of obsenring A'B'C elsewhere In the ^nome. 
gnren that the paralogs exist Is l/Af* Three pro- 
teins can occur across a spread of five positions In 
six %vays; more generally, we compute the number 
of ways that K proteins can be spread across J 
positions by counting alt possible anrangements of K 
- 2 proteins In they - 2 positions between the first 
and last proteia AUowing for a spread to vaiy from • 
positions (no gaps) to/ gives 
/-» 
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arrangements. Thus, the probablUty of chance occur- 
rence 1$ UN^'-\ AUowing for both sets of genes (e.g., 
ABC and A'B'C) to be spre^<i across J positions 
Increases this to LVN<-\ The duplicated segment 
niight be rearranged by the operations of reversal or 
translocation: allowing for M such rearrangements 
gnres us a probabiUty P » L^M/N<'\ For example, the 



probability of observing a duplicated set of three 
- genes In two different locations, where the three 
genes occur across a spread of five positions In both 

• locations,. 1$ 3&fNh the expected number of. such 

• matched sets In the predicted protein set Is approx- 
imately (/V)36//V*.« 36//V, a value <K1. Therefore, 
any such duplications of three genes are unlikely to 
result from random reanrangements of the genome. If 
any of the genes occur in more than two copies, the 
probabiUty that the apparent duplication has oc- 

. . - . curred by chance Increases. The algorithm for select- . 
—..•-.Ing candidale.dupUcations only generates matched 

-protein sets with ^ <C 1. 
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1 16. Brief' description of the methods used to build the' ■ 
Panther classification. First, the June 2000 release of 
the GcnBank NR protein database (excluding se- 
quences annotated as fragments or mutants) was 
partitioned Into dusters using BIASTP. For the clus- 
tering a seed sequence was randomly chosen, "and 
the cluster was defined as all sequences matching 
the seed to statistical significance (E-value < 10~') 

• and "globally* alignable (the length of the match 
region must be >7096 and <130% of the length of 
the seed). If the cluster had more than Five mem- 



bers, and at least one from a multicellular eu- 

• ^^'yote, the duster was extended. For the extension 
■ . ■ • step, a hidden Markov Model (HMM) was trained for 

the cluster, using the SAM software package ver- 
K.»"/^* V'l "^^^ scored against GerJeank 
NR (excluding mutants but Including fragments for 
this step), and all sequences scoring better than a 
specific (NLL-NUU) score were added to the cluster 
The HMM was then retrained (v«th fixed model 
. - length) and all sequences In the cluster were aligned 
. :• -to the HMM to produce a multiple sequence align- 

- mentThis alignment was assessed by a number of 
■■ ■■■ quality measures. If the alignment failed the quality 
■y y check, the initial cluster was rebuilt around the seed 

• using a more restrictive E-value; followed by exten- 
sion, alignment, and reassessment This process was 
repeated until the alignment quality was good. The 
multiple alignment and •general*' (i.e.- describing 
the entire cluster, or -family*) HMM (776) were 
then used as Input Into the BETE program (;773 
BETE calculates a phylogenetic tree for the sequenc- 

. es in the alignment Functional Information about 
the sequences in each cluster were parsed from 
SwissProt (778) and CenBank records. Tree-at- 
* tribute viewer" software was used by biologist cu- 
rators to correlate the phylogenetic tree with pro- 
■ teinfunrtion. Subfamilies were manually defined on 
the basis of shared function across subtrees, and 
were named accordingly. HMMs were then built for 
each subfamily. uHrxg Information fr^m both the 
subfamily and family (K. Sjalander, In preparation). 
Families were also manually named according to the 
functions contained within them. Finally, all of the 
families and subfamilies were classified Into cate- 
gories and subcategories based on their molecular 
functions. The categorization wais done by manual 
review of the family and subfamily names, by ex- 
amining SwissProt and CenBank records, and by 
review of the literature as well as resources on the 
Wortd Wide Web. The cunrent version (2.0) of the 
Panther molecular function schema has four levels: 
categoiy. subcategory, family, and subfamily. Pro- 
tein sequences foe whole eukatyotic genomes (for . 
the predicted hurtiari proteins and annotated pro- 
-teins for. fly. worm, yeast and Arabidopsis) were 

- scored ; against the Panther library of family and 
subfamily HMMs. If the score was significant (the 
NLL-NULL score cutoff depends on the protein fam- 
ily), the protein was assigned to the family or 
subfamily funrtion with the most significant score. 
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EXHIBIT L 




). A historic 
moment for 
the scientific 
endeavor. 



THE HUMAN 
GENOME 

iimanity has been given a great gift. With the completion" of the hiunah • 
genome sequence, we have received a ppwerilil tool for imlocking the 
secrets of our genetic heritage and for finding piir place among the other 
participants in the adventure of life, ' 

This week*s issue of Science contains the report of the seqtfcncing of 
the human genome from a group of authors led by Craig VenterlofCelera^ *■ 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis CoUiris appears 
in this week's Nature. This stunning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
■ mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accomplishment far sooner than was believed possible. 
Thus, we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that . ^ 

has given us Uvo winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence is bvaluable. Indeed, a real-world proof of the importance of access tc> both sets of data pan 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et al (p. 1298).^ 

Althou2h we have made the point before, it is worth repeating that the sequencing of the himian 
genome represents, not an ending, but the begimung of a new approach to biology. As Galas s^s m 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
highlight how this approach is already beginning to revolutionize the way we look at human disease. 

This has been a massive project, on a scale unparalleled in the history of biology, but of coursp 
it has built on the scientific insights of centuries of investigators. By coincidence, this l^dmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin. Darwm^s 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to the Celera data. 
(Full information regardirig the agreements that were reached to make the data available can be 
found at Nwvw,sciencemag.org/feature/data/announcement/gsp.shl.) We are wiUmg to be flexible.in : 
allowing data repositories other than the traditional GenBank, while insisting on access to all the. 
data needed to verify conclusions. In this domaiii, change is everywhere:- Commercial rese^chers • 
are producing more and more potentially valuable sequences, yet (at least m the United States) 
laws governing databases provide scant protection against piracy Had the Celera data been kept se- 
cret, it would have been a serious loss to the scientific community. We hope that our adaptability in 
the face of change will enable other proprietary data to be published after peer review, m a way that 
satisfies our continuing commitment to full access. • ^ t" j 1 T ii • 

It should be no smprise that an achievement so stunning, and so careftilly watched, has created 
new challenges for the scientific venture. Scienc:e is proud to have played a role m bringing this 
discovery onto the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, it is a library, m which, with 
rules diat encourage exploration and reward creativity, we can find many of the books that wUl 



help define us and our place in the great tapestry of life. 



Barbara R. jasny and Donald Kennedy 




16 FEBRUARY 2001 




1153 



EXHIBIT M 



Querys SEQ ID N0:1 

(1923 letters) 



Score E 

Sequences producing significant alignments: (bits) Value 

AL356421. 10. 1.170532 2811 0.0 

>AL356421. 10. 1.170532 

Length = 170532 

Score = 2811 bits (1418), Expect =0.0 
Identities = 1424/1426 (99%) 
Strand = Plus / Plus 



Query: 498 gcagagttacagcaccatagccaaccacattcttaacagcaaaagcatctccaactggac 557 

llllllllllllllllllllllllllllllllllillllllllllllMillMIIIIII - 

Sbjct: 56950 gcagagttacagcaccatagccaaccacattcttaacagcaaaagcatctccaactggac 57009 



Query: 558 tttcattcctgacagaaacagcagctatatcc tgctacat tcagtcaactcctttgcaag 617 

IIMIIIIIMIIIIMIIMIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct: 57010 tttcattcctgacagaaacagcagctat^tcctgctacattcagtcaactcctttgcaag 57069 



Query : 
Sbjct : 



618 aaggctattcatagataaacatcctgttgacatatcagatgtcttcattcatactatggg 677 

IIIMllillllllMIIIMIMIIIIIIIMIIIillllllllllMIIIIIMIIII 

57070 aaggc tat tcatagataaacatcctgttgacatatcagatgtct teat tcatactatggg 57129 



Query : 678 caccaccatatctggagataacattggaaaaaatttcac tt tttctatgagaattaatga 737 

MMIMIIMIMMIIIIMIIIIIIIIIIIIIIIIIIIMIIIMIMMIIIIIII 

Sbjct: 57130 caccaccatatctggagataacattggaaaaaatttcactttttctatgagaattaatga 57189 
Query: 738 taccagcaatgaagtcac tgggagagtgttgatcagcagagatgaact tcggaaggtgcc 797 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I M 1 1 1 1 I I I M I i I M I 1 1 1 

Sbjct: 57190 taccagcaatgaagtcactgggagagtgttgatcagcagagatgaacttcggaaggtgcc 57249 



Query : 
Sbjct: 



798 ttccccttctcaggtcatcagcattgcatttccaactattggggctattttggaagccag 857 

1 1 1 I M M 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 N 1 1 1 1 1 1 1 1 i I M I 1 1 M I I M 1 1 1 1 I M 1 1 1 

57250 ttccccttctcaggtcatcagcattgcatttccaactattggggctattttggaagccag 57 3 09 



Query: 
Sbjct: 



858 tcttttggaaaatgttactgtaaatgggcttgtcctgtctgccattttgcccaaggaact 917 

III III Mill INI I Mill III II IIIIIIMIIII II II II Ml I II III II I II II 

57310 tcttttggaaaatgttactgtaaatgggcttgtcctgtctgccattttgcccaaggaact 57369 



Query: 918 taaaagaatctcactgatttttgaaaagatcagcaagtcagaggagaggaggacacagtg 977 

IIIMIMIIMIIIIMIIIIIIIMIIMIIIIIIMIIIIIMIIIIIIM MM II 

Sbjct: 57370 taaaagaatctcactgatttttgaaaagatcagcaagtcagaggagaggaggacacagtg 57429 



Query: 978 tgttggctggcactctgtggagaacagatgggaccagcaggcctgcaaaatgattcaaga 1037 

IIIIIIIIIMIIIIMIIIMMIIIIMIIIIIiMlllllllllllillllllllM 

Sbjct : 5743 0 tgttggctggcactctgtggagaacagatgggaccagcaggcctgcaaaatgattcaaga 57489 
Query: 1038 aaactcccagcaagctgtttgcaaatgtaggccaagtgaattgtttacctctttctcaat 1097 

IMIIIIIIIMIIIIIMMIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIMMI 

Sbjct: 57490 aaactcccagcaagctgtttgcaaatgtaggccaagcaaattgtttacctctttctcaat 57549 



Query: 
Sbjct : 



1098 



57550 



tcttatgtcacctcacatcttagagagtctgattctgacttacatcacatatgtaggcct 1157 

IIIIMIIIIIIIIIIIIIIIIIMIIIIIIMIIMIIIIIIIIIIIIIIIIIIIIIII 

tcttatgtcacctcacatcttagagagtctgattctgacttacatcacatatgtaggcct 57609 



Query: 1158 gggcatttctatttgcagcctgatcctttgcttgtccattgaggtcctagtctggagcca 1217 

1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 i M M 1 1 1 1 1 1 1 1 M M 1 1 1 1 1 1 1 

Sbjct : 57610 gggcatttctatttgcagcctgatcctttgcttgtccattgaggtcctagtctggagcca 57669 

Query: 1218 agtgacaaagacagagatcacctatttacgccatgtgtgcattgttaacattgcagccac 1277 

1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 N M M 1 1 M M 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 

Sbjct : 57670 agtgacaaagacagagatcacctatttacgccatgtgtgcattgttaacattgcagccac 57729 

Query: 1278 tttgctgatggcagatgtgtggttcattgtggcttcctttcttagtggcccaataacaca 1337 

IIIIMIIIIIIIIIIMIIIIIIIIIIIMIIillilillllllllllillllllMII 

Sbjct : 57730 tttgctgatggcagatgtgtggttcattgtggcttcctttcttagtggcccaataacaca 57789 

Query: 1338 ccacaagggatgtgtggcagccacattttttgttcatttcttttacctttctgtattttt 1397 

IIIMIIIIIIIillllllilllllillllilllllllllllllllllllllMlillll 

Sbjct: 57790 ccacaagggatgtgtggcagccacattttttgttcatttcttttacctttctgtattttt 57849 

Queiry: 1398 ctggatgcttgccaaggcactccttatcctc tatggaatcatgattgttttccatacctt 1457 

llllllllllilllllllllllllllllllllllllllllilllllMIIIIIIIIIIII 

Sbjct: 57850 ctggatgcttgccaaggcactccttatcctctatggaatcatgattgttttccatacctt 57909 

Query: 1458 gcccaagtcagtcctggtggcatctctgttttcagtgggctatggatgccctttggccat 1517 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 M N M 1 1 1 1 i 1 1 1 1 

Sbjct : 57910 gcccaagtcagtcctggtggcatctctgttttcagtgggctatggatgccctttggccat 57969 

Query: 1518 tgctgccatcactgttgctgccactgaacctggcaaaggctatctacgacctgagatctg 1577 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 

Sbjct : 57970 tgctgccatcactgttgctgccactgaacctggcaaaggctatctacgacctgagatctg 58029 



Query: 1578 ctggctcaactgggacatgaccaaagccctcctggccttcgtgatcccagctttggccat 1637 

lllllllllllllllllllilllllilllllillMIMIIIIIIillllllllllllll 

Sbjct : 58030 ctggctcaactgggacatgaccaaagccctcctggccttcgtgatcccagctttggccat 58089 



Query: 
Sbjct : 



1638 
58090 



cgtggtagtaaacctgatcacagtcacactggtgattgtcaagacccaTgcgagctgccat 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

cgtggtagtaaacctgatcacagtcacactggtgattgtcaagacccagcgagctgccat 



1697 . 
58149^ 



Query: 1698 tggcaattccatgttccaggaagtgagagccattgtgagaatcagcaagaacatcgccat 17 57 

IIIIIIIIIII.IIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIilllllllMII 

Sbjct : 58150 tggcaattccatgttccaggaagtgagagccattgtgagaatcagcaagaacatcgccat 58209 
Query: 1758 cctcacaccacttctgggactgacctggggatttggagtagccactgtcatcgatgacag 1817 

lllllllillllllllMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIII 

Sbjct: 58210 cctcacaccacttctgggactgacctggggatttggagtagccactgtcatcgatgacag 58269 
Query: 1818 atccctggccttccacattatcttctccctgctcaatgcattccaggtaagtccagatgc 1877 

lllllllilMlillllllllllllllllllMMIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct: 58270 atccctggccttccacattatcttctccctgctcaatgcattccaggtaagtccagatgc 58329 
Query: 1878 ttctgaccaagtgcaaagtgagagaattcatgaagatgttctgtga 1923 

IIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIMIIIIM 

Sbjct : 58330 ttctgaccaagtgcaaagtgagagaattcatgaagatgttctgtga 58375 



Score = 468 bits (236), Expect = e-129 
Identities = 236/236 (100%) 
Strand = Plus / Plus 



Query: 266 aggaggattcacgtttggttcagccatttgaagacaatataaaaataagtgtatatactg 325 

IIIIIIIMMIIIIIIIiMIIIMIIIIilMIMIIIIIIIIIIMIIIIIIIIIII 

Sbjct: 55758 aggaggattcacgtttggttcagccatttgaagacaatataaaaataagtgtatatactg 55817 

Query: 326 gaaagtctgagaccataacagatatgttgctacaaaagtgtcccacagatctgtcttgtg 3 85 

IIMIIIMIillillllllllllllllllllllllllllllllllllllilllllllll 

Sbjct: 55818 gaaagtctgagaccataacagatatgttgctacaaaagtgtcccacagatctgtcttgtg 55877 



Query: 386 taattagaaacattcagcagtctccctggataccaggaaacattgccgtaattgtgcagc 

lllllilllMIIIIIMIIIIIillllllilllllllllililillMIIIIINIIII 

Sbjct : 55878 taattagaaacattcagcagtctccctggataccaggaaacattgccgtaattgtgcagc 



445 



55937 



Query: 446 tcttacacaacatatcaacagcaatatggacaggtgttgatgaggcaaagatgcag 501 

IIIIIIIIMIMIIIIIIMIIIIIIIIIIIIMIIIIIllllllllMllllli 

Sbjct: 55938 tcttacacaacatatcaacagcaatatggacaggtgttgatgaggcaaagatgcag 55993 



Score = 297 bits (150), Expect = le-77 
Identities = 153/154 (99%) 
Strand = Plus / Plus 



Query: 115 ggtgtatgcgatggtgtctgtacagactacccccagtgtactcaaccttgccctccagac 174 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiii ' 

Sbjct : 54671 ggtgtatgcgatggtgtctgtacagactactcccagtgtactcaaccttgccctccagac 54730 
Query: 175 actcagggaaatatggggttttcatgcaggcaaaagacatggcacaagatcactgacacc 234 

IIIIIIMIIIIIIIIIIIIIIillllllllllllMIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 54731 actcagggaaatatggggttttcatgcaggcaaaagacatggcacaagatcactgacacc 54790 



Query: 235 tgccagactcttaatgccctcaacatctttgagg 2 68 

llllllllilMlllllllllllllllllllill 

Sbjct: 54791 tgccagactcttaatgccctcaacatctttgagg 54824 



Score = 145 bits (73), Expect = le-31 
Identities = 73/73 (100%) 
Strand = Plus / Plus 



Query: 1 atgactcatatacttttgctgtactacttggtgtttcttttgcccacagagtcctgtagg 60 

MIMIIIIIIIIIIIIIIIIIIMIIIIMMIIIIIIMIIIIIMMIIIIMIIM 

Sbjct: 49109 atgactcatatacttttgctgtactacttggtgtttcttttgcccacagagtcctgtagg 49168 
Query: 61 acattgtatcagg 73 

IMIIIIIIIIII 

Sbjct: 49169 acattgtatcagg 49181 



Score = 93.7 bits (47), Expect = 4e-16 
Identities = 47/47 (100%) 
Strand = Plus / Plus 



Query: 71 aggctgcaagcaaaagcaaggagaaggtgcctgccaggccacacggt 117 

1 1 1 M 1 1 1 1 1 M M i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 " 

Sbjct: 53530 aggctgcaagcaaaagcaaggagaaggtgcctgccaggccacacggt 53576 
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LOCUS 

DEFINITION 



ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



COMMENT 



AL356421 170532 bp DNA linear PRI 02-DEC-2004 

Human DNA sequence from clone RP11-550C4 on chromosome 6 Contains 
the 3' end of the CD2AP gene for CD2-associated protein, the GPRlll 
gene for G protein-coupled receptor 111, the GPR115 gene for G 
protein-coupled receptor 115, the 5' end of a gene for novel 
protein similar to seven transmembrane receptor (Rhodopsin family) , 
a novel gene and a r.ibosomal protein 27A (RPL27A) pseudogene, 
complete sequence. 
AL356421 

AL35 642 1.10 GI : 10443437 

HTG; CD2AP; GPRlll; GPR115; rhodopsin; RPL27A; secretin; seven 

transmembrane . 

Homo sapiens (human) 

Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 170532) 
Corby, N. 

Direct Submission 

Submitted ( 22 -NOV-2004 ) Wellcome Trust Sanger Institute, Hinxton, 
Cambridgeshire, CBIO ISA, UK. E-mail enquiries: vega@sanger.ac.uk 
Clone requests: clonerequest(isanger . ac . uk 

On Oct 1, 2000 this sequence version replaced gi : 1018*6530 . 
The following abbreviations are used to associate primary accession 
numbers given in the feature table with their source databases: 
Em:, EMBL; Sw : , SWISSPROT; Tr : , TREMBL; Wp : , WORMPEP; Information 
on the WORMPEP database can be found at 

http : / /www . Sanger . ac , uk/ Pro j ec t s/C_elegans/wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome 6, constructed by the Sanger Centre Chromosome 6 Mapping 
Group. Further information can be found at 
http : / /www . Sanger . ac . uk/HGP/Chr6 

Genome Center 

Center: Wellcome Trust Sanger Institute 
Center code: SC . 
Web site: http : / /www . Sanger . ac . uk 
Contact: vega@sanger.ac.uk 



This sequence was finished as follows unless otherwise noted: all 
regions were either double- stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30) ; an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one subclone; and the assembly was confirmed by restriction digest, 
except on the rare occasion of the clone being a YAC. 
RP11-550C4 is from the library RPCI-11.2 constructed by the group 



http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=iiucleotide&val=10443437 
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